Matthew Honnibal
301f3cc898
Fix Issue #429 . Add an initialize_state method to the named entity recogniser that adds missing entity types. This is a messy place to add this, because it's strange to have the method mutate state. A better home for this logic could be found.
2016-10-27 18:01:55 +02:00
Matthew Honnibal
03a520ec4f
Change signature of Parser.parseC, so that nr_class is read from the transition system. This allows the transition system to modify the number of actions in initialize_state.
2016-10-27 17:58:56 +02:00
Matthew Honnibal
a209b10579
Improve error message when oracle fails for non-projective trees, re Issue #571 .
2016-10-24 20:31:30 +02:00
Matthew Honnibal
3e688e6d4b
Fix issue #514 -- serializer fails when new entity type has been added. The fix here is quite ugly. It's best to add the entities ASAP after loading the NLP pipeline, to mitigate the brittleness.
2016-10-23 17:45:44 +02:00
Matthew Honnibal
59038f7efa
Restore support for prior data format -- specifically, the labels field of the config.
2016-10-17 00:53:26 +02:00
Matthew Honnibal
7887ab3b36
Fix default use of feature_templates in parser
2016-10-16 21:41:56 +02:00
Matthew Honnibal
f787cd29fe
Refactor the pipeline classes to make them more consistent, and remove the redundant blank() constructor.
2016-10-16 21:34:57 +02:00
Matthew Honnibal
274a4d4272
Fix queue Python property in StateClass
2016-10-16 17:04:41 +02:00
Matthew Honnibal
e8c8aa08ce
Make action_name optional in StepwiseState
2016-10-16 17:04:16 +02:00
Matthew Honnibal
4fc56d4a31
Rename 'labels' to 'actions' in parser options
2016-10-16 11:42:26 +02:00
Matthew Honnibal
3259a63779
Whitespace
2016-10-16 01:47:28 +02:00
Matthew Honnibal
d9ae2d68af
Load features by string-name for backwards compatibility.
2016-10-12 20:15:11 +02:00
Matthew Honnibal
3a03c668c3
Fix message in ParserStateError
2016-10-12 14:44:31 +02:00
Matthew Honnibal
6bf505e865
Fix error on ParserStateError
2016-10-12 14:35:55 +02:00
Matthew Honnibal
ea23b64cc8
Refactor training, with new spacy.train module. Defaults still a little awkward.
2016-10-09 12:24:24 +02:00
Matthew Honnibal
1d70db58aa
Revert "Changes to iterators.pyx for new StringStore scheme"
...
This reverts commit 4f794b215a
.
2016-09-30 20:19:53 +02:00
Matthew Honnibal
9e09b39b9f
Revert "Changes to transition systems for new StringStore scheme"
...
This reverts commit 0442e0ab1e
.
2016-09-30 20:11:49 +02:00
Matthew Honnibal
e3285f6f30
Revert "Fix report of ParserStateError"
...
This reverts commit 78f19baafa
.
2016-09-30 20:11:33 +02:00
Matthew Honnibal
78f19baafa
Fix report of ParserStateError
2016-09-30 19:59:22 +02:00
Matthew Honnibal
0442e0ab1e
Changes to transition systems for new StringStore scheme
2016-09-30 19:58:51 +02:00
Matthew Honnibal
4f794b215a
Changes to iterators.pyx for new StringStore scheme
2016-09-30 19:57:49 +02:00
Matthew Honnibal
4cbf0d3bb6
Handle errors when no valid actions are available, pointing users to the issue tracker.
2016-09-27 19:19:53 +02:00
Matthew Honnibal
430473bd98
Raise errors when no actions are available, re Issue #429
2016-09-27 19:09:37 +02:00
Matthew Honnibal
8e7df3c4ca
Expect the parser data, if parser.load() is called.
2016-09-27 14:02:12 +02:00
Matthew Honnibal
a44763af0e
Fix Issue #469 : Incorrectly cased root label in noun chunk iterator
2016-09-27 13:13:01 +02:00
Matthew Honnibal
e07b9665f7
Don't expect parser model
2016-09-26 18:09:33 +02:00
Matthew Honnibal
ee6fa106da
Fix parser features
2016-09-26 17:57:32 +02:00
Matthew Honnibal
e607e4b598
Fix parser loading
2016-09-26 17:51:11 +02:00
Matthew Honnibal
2debc4e0a2
Add .blank() method to Parser. Start housing default dep labels and entity types within the Defaults class.
2016-09-26 11:57:54 +02:00
Matthew Honnibal
fd65cf6cbb
Finish refactoring data loading
2016-09-24 20:26:17 +02:00
Matthew Honnibal
83e364188c
Mostly finished loading refactoring. Design is in place, but doesn't work yet.
2016-09-24 15:42:01 +02:00
Matthew Honnibal
60fdf4d5f1
Remove commented out debuggng code
2016-09-24 01:17:18 +02:00
Matthew Honnibal
070af4af9d
Revert "* Working neural net, but features hacky. Switching to extractor."
...
This reverts commit 7c2f1a673b
.
2016-09-21 12:26:14 +02:00
Matthew Honnibal
7c2f1a673b
* Working neural net, but features hacky. Switching to extractor.
2016-05-26 19:06:10 +02:00
Matthew Honnibal
13fad36e49
* Cosmetic change to english noun chunks iterator -- use enumerate instead of range loop
2016-05-20 10:11:05 +02:00
Wolfgang Seeker
7b78239436
add fix for German noun chunk iterator (issue #365 )
2016-05-06 01:41:26 +02:00
Matthew Honnibal
bb94022975
* Fix Issue #365 : Error introduced during noun phrase chunking, due to use of corrected PRON/PROPN/etc tags.
2016-05-06 00:21:05 +02:00
Wolfgang Seeker
dbf8f5f3ec
fix bug in StateC.set_break()
2016-05-05 15:15:34 +02:00
Wolfgang Seeker
3c44b5dc1a
call deprojectivization after parsing
2016-05-05 15:10:36 +02:00
Matthew Honnibal
472f576b82
* Deprojectivize German parses
2016-05-05 15:01:10 +02:00
Wolfgang Seeker
e4ea2bea01
fix whitespace
2016-05-04 07:40:38 +02:00
Wolfgang Seeker
5bf2fd1f78
make the code less cryptic
2016-05-03 17:19:05 +02:00
Wolfgang Seeker
a06fca9fdf
German noun chunk iterator now doesn't return tokens more than once
2016-05-03 16:58:59 +02:00
Wolfgang Seeker
7b246c13cb
reformulate noun chunk tests for English
2016-05-03 14:24:35 +02:00
Matthew Honnibal
1f1532142f
* Fix cost calculation on non-monotonic oracle
2016-05-03 00:21:08 +02:00
Matthew Honnibal
508fd1f6dc
* Refactor noun chunk iterators, so that they're simple functions. Install the iterator when the Doc is created, but allow users to write to the noun_chunk_iterator attribute. The iterator functions accept an object and yield (int start, int end, int label) triples.
2016-05-02 14:25:10 +02:00
Matthew Honnibal
77609588b6
* Fix assignment of root label to words left as root implicitly, after parsing ends.
2016-04-25 19:41:59 +00:00
Matthew Honnibal
7c2d2deaa7
* Revise transition system so that the Break transition retains sole responsibility for setting sentence boundaries. Re Issue #322
2016-04-25 19:41:59 +00:00
Wolfgang Seeker
12024b0b0a
bugfix: introducing multiple roots now updates original head's properties
...
adjust tests to rely less on statistical model
2016-04-20 16:42:41 +02:00
Wolfgang Seeker
b98cc3266d
bugfix: iterators now reset properly when called a second time
2016-04-15 17:49:16 +02:00
Wolfgang Seeker
289b10f441
remove some comments
2016-04-14 15:37:51 +02:00
Wolfgang Seeker
d99a9cbce9
different handling of space tokens
...
space tokens are now always attached to the previous non-space token
there are two exceptions:
leading space tokens are attached to the first following non-space token
in input that consists exclusively of space tokens, the last space token
is the head of all others.
2016-04-13 15:28:28 +02:00
Wolfgang Seeker
d328e0b4a8
Merge branch 'master' into space_head_bug
2016-04-11 12:11:01 +02:00
Wolfgang Seeker
80bea62842
bugfix in unit test
2016-04-08 16:46:44 +02:00
Wolfgang Seeker
1fe911cdb0
bigfix
2016-04-07 18:19:51 +02:00
Matthew Honnibal
872695759d
Merge pull request #306 from wbwseeker/german_noun_chunks
...
add German noun chunk functionality
2016-04-08 00:54:24 +10:00
Wolfgang Seeker
7195b6742d
add restrictions to L-arc and R-arc to prevent space heads
2016-03-28 10:40:52 +02:00
Wolfgang Seeker
5e2e8e951a
add baseclass DocIterator for iterators over documents
...
add classes for English and German noun chunks
the respective iterators are set for the document when created by the parser
as they depend on the annotation scheme of the parsing model
2016-03-16 15:53:35 +01:00
Wolfgang Seeker
46e3f979f1
add function for setting head and label to token
...
change PseudoProjectivity.deprojectivize to use these functions
2016-03-11 17:31:06 +01:00
Wolfgang Seeker
7adbd7a785
replace Counter with normal dict
2016-03-03 21:36:27 +01:00
Wolfgang Seeker
1ae487a4f6
add backwards compatibility with python 2.6
2016-03-03 21:18:12 +01:00
Wolfgang Seeker
72b8df0684
turned PseudoProjectivity into a normal python class
2016-03-03 19:05:08 +01:00
Wolfgang Seeker
690c5acabf
adjust train.py to train both english and german models
2016-03-03 15:21:00 +01:00
Wolfgang Seeker
3448cb40a4
integrated pseudo-projective parsing into parser
...
- nonproj.pyx holds a class PseudoProjectivity which currently holds
all functionality to implement Nivre & Nilsson 2005's pseudo-projective
parsing using the HEAD decoration scheme
- changed lefts/rights in Token to account for possible non-projective
structures
2016-03-01 10:09:08 +01:00
Wolfgang Seeker
56b7210e82
moved nonproj.py to syntax/nonproj.pyx
2016-02-25 15:08:49 +01:00
Matthew Honnibal
1b83cb9dfa
* Fix Issue #251 : Incorrect right edge calculation on left-clobber low in the tree
2016-02-07 00:00:42 +01:00
Matthew Honnibal
4412a70dc5
* Initialize StateC._empty_token to 0, to avoid undefined behaviour.
2016-02-06 13:34:38 +01:00
Matthew Honnibal
1b41f868d2
* Check for errors in parser, and parallelise the left-over batch
2016-02-06 10:06:30 +01:00
Matthew Honnibal
165ca28b80
* Set is_parsed flag in Parser.pipe
2016-02-05 19:51:44 +01:00
Matthew Honnibal
bdd579db0a
* Set is_parsed flag in Parser.pipe
2016-02-05 19:50:11 +01:00
Matthew Honnibal
b04c9aad71
* Fix off-by-one in Parser.pipe
2016-02-05 19:37:50 +01:00
Matthew Honnibal
048dfe35aa
* cimport cython.parallel
2016-02-05 12:20:42 +01:00
Matthew Honnibal
8a13cebdcc
* Update for modified thinc interface
2016-02-05 11:44:39 +01:00
Matthew Honnibal
84b247ef83
* Add a .pipe method, that takes a stream of input, operates on it, and streams the output. Internally, the stream may be buffered, to allow multi-threading.
2016-02-03 02:10:58 +01:00
Matthew Honnibal
e3db39dd21
* Fix compiler warning about signed/unsigned comparison
2016-02-01 09:08:07 +01:00
Matthew Honnibal
b3802562d6
Merge branch 'rethinc2' of https://github.com/honnibal/spaCy into rethinc2
2016-02-01 08:59:24 +01:00
Matthew Honnibal
4b08a3fafd
* Fix merge conflict
2016-02-01 08:58:18 +01:00
Matthew Honnibal
5188f6d9d8
* Fix parseC function
2016-02-01 08:48:48 +01:00
Matthew Honnibal
bcf8f7ba40
* Add a parse_batch method to Parser, that releases the GIL around a batch of documents.
2016-02-01 08:34:55 +01:00
Matthew Honnibal
d5579cd0d8
Merge branch 'rethinc2' of https://github.com/honnibal/spaCy into rethinc2
2016-02-01 03:08:49 +01:00
Matthew Honnibal
490ba65398
* Use openmp in parser
2016-02-01 03:08:42 +01:00
Matthew Honnibal
cb78d91ec5
* Fix ArcEager.set_valid
2016-02-01 03:07:37 +01:00
Matthew Honnibal
28e5ad62bc
* Pass a StateC pointer into the transition and validation methods in the parser, so that the GIL can be released over a batch of documents
2016-02-01 03:00:15 +01:00
Matthew Honnibal
a47f00901b
* Pass a StateC pointer into the transition and validation methods in the parser, so that the GIL can be released over a batch of documents
2016-02-01 02:58:14 +01:00
Matthew Honnibal
daaad66448
* Now fully proxied
2016-02-01 02:37:08 +01:00
Matthew Honnibal
7a0e3bb9c1
* Continue proxying. Some problem currently
2016-02-01 02:22:21 +01:00
Matthew Honnibal
2169bbb7ea
* Shadow StateClass with StateC, to start proxying
2016-02-01 01:16:14 +01:00
Matthew Honnibal
2fa228458e
* Add _state file, which StateClass will proxy to
2016-02-01 01:09:21 +01:00
Matthew Honnibal
9410e74c92
* Switch parser to use nogil functions
2016-01-30 20:27:07 +01:00
Matthew Honnibal
10877a7791
* Update for thinc 5.0, including changing cost from int to weight_t, and updating the tagger and parser
2016-01-30 14:31:36 +01:00
Matthew Honnibal
84c5dfbfc3
* Clean up debugging python list
2016-01-19 20:10:32 +01:00
Matthew Honnibal
04d0686b26
* Make TransitionSystem.add_action idempotent, i.e. ignore duplicate added actions.
2016-01-19 20:10:04 +01:00
Matthew Honnibal
65c5bc4988
* Add add_label method, to allow users to register new entity types and dependency labels.
2016-01-19 19:11:02 +01:00
Matthew Honnibal
151aa0b0e2
* Allow users to add_label, in order to extend the entity recogniser to new classes. Does not by itself add a class to the model
2016-01-19 19:09:33 +01:00
Matthew Honnibal
c8e0011ebc
* Add iterators to the NER and parser transition systems, to get the action types
2016-01-19 19:07:43 +01:00
Matthew Honnibal
04177debd0
* Unwind limit to sentence boundary detection that prevents it from inserting boundaries on whitespace. Replace it with a check for whitespace in StateClass.fast_forward, so that whitespace is LeftArced when it's on the stack. This should prevent the previous problem of whitespace-only sentences. Should fix Issue #184 , but may cause further problems. Needs testing.
2016-01-19 02:54:15 +01:00
Matthew Honnibal
3dc398b727
* Fix merge conflict in requirements.txt
2016-01-16 16:20:49 +01:00
Matthew Honnibal
c025a0c64b
* Check for KeyboardInerrupt in parser.__call__
2016-01-16 16:18:44 +01:00
Matthew Honnibal
aec130af56
Use util.Package class for io
...
Previous Sputnik integration caused API change: Vocab, Tagger, etc
were loaded via a from_package classmethod, that required a
sputnik.Package instance. This forced users to first create a
sputnik.Sputnik() instance, in order to acquire a Package via
sp.pool().
Instead I've created a small file-system shim, util.Package, which
allows classes to have a .load() classmethod, that accepts either
util.Package objects, or strings. We can later gut the internals
of this and make it a proxy for Sputnik if we need more functionality
that should live in the Sputnik library.
Sputnik is now only used to download and install the data, in
spacy.en.download
2015-12-29 18:00:48 +01:00
Matthew Honnibal
5623242b3e
* Adjust NER rules, so that U entries in gazetteer don't become B moves to the model
2015-11-12 04:48:23 +11:00
Matthew Honnibal
44fbdc7260
* Fix bug in NER transition system, that sometimes left no valid moves
2015-11-08 16:19:12 +01:00
Matthew Honnibal
e92371bb54
* Fix rule that made Last action invalid if there was a preset of O, since if the entity is already open, that ship has sailed.
2015-11-08 22:17:51 +11:00
Matthew Honnibal
6f47074214
* Make constructor of ParserModel and TaggerModel the same as AveragedPerceptron, for each pickling.
2015-11-07 18:25:17 +11:00
Matthew Honnibal
1cfa20fb17
* Fix sentence-final whitespace issue
2015-11-07 17:34:46 +11:00
Matthew Honnibal
888c05a7fa
* Fix variable naming in StepwiseState, for thinc 4.0
2015-11-07 11:02:44 +11:00
Matthew Honnibal
fc2185bfe3
* Fix variable naming in StepwiseState, for thinc 4.0
2015-11-07 10:48:31 +11:00
Matthew Honnibal
954442a807
* Fix variable naming in StepwiseState, for thinc 4.0
2015-11-07 10:30:45 +11:00
Matthew Honnibal
af70dc166a
* Fix Last restriction, that was supposed to prevent conflicts with presets, but was incorrect.
2015-11-07 09:52:00 +11:00
Matthew Honnibal
a06e3c8963
* Fix bone-headed mistake in StateClass.E
2015-11-07 07:35:28 +11:00
Matthew Honnibal
d24b8509e4
* Correct screw ups from the previous commits
2015-11-07 06:51:41 +11:00
Matthew Honnibal
5efad178b5
* Set ent tag when close entity
2015-11-07 06:09:25 +11:00
Matthew Honnibal
9285f01d26
* Fix broken StateClass.E tracking
2015-11-07 06:06:39 +11:00
Matthew Honnibal
19136b0e7d
* Add better debug message for illegal move
2015-11-07 05:34:37 +11:00
Matthew Honnibal
2733816b7b
* Fix whitespace
2015-11-07 05:31:06 +11:00
Matthew Honnibal
01ab464383
* Prevent Begin and In moves from applying in NER if we're at the last token of a sentence, as this would mean the entity would span over a sentence boundary. Re Issue #169
2015-11-07 05:30:44 +11:00
Matthew Honnibal
b65633f270
* Fix function that returns nth entity in StateClass. Was only returning the first.
2015-11-07 05:29:11 +11:00
Matthew Honnibal
3c162dcac3
* Refactor away from the _ml module, to use thinc 4.0. Still some work needs to be done, e.g. to add __reduce__ to the models, more testing, etc.
2015-11-07 03:24:30 +11:00
Matthew Honnibal
b9991fbd20
* Update to use thinc 3.0
2015-11-06 00:25:59 +11:00
Matthew Honnibal
68f479e821
* Rename Doc.data to Doc.c
2015-11-04 00:15:14 +11:00
Matthew Honnibal
329ae57520
* Fix whitespace attachment thing
2015-10-13 09:46:38 +02:00
Matthew Honnibal
37919eac82
* Fix whitespace attachment in simpler way. Leaves problem with setting left/right children.
2015-10-13 18:23:24 +11:00
Matthew Honnibal
c70eb776ae
* Fix whitespace attachment, so that left/right children are consistent with head.
2015-10-13 15:58:22 +11:00
Matthew Honnibal
20fd36a0f7
* Very scrappy, likely buggy first-cut pickle implementation, to work on Issue #125 : allow pickle for Apache Spark. The current implementation sends stuff to temp files, and does almost nothing to ensure all modifiable state is actually preserved. The Language() instance is a deep tree of extension objects, and if pickling during training, some of the C-data state is hard to preserve.
2015-10-13 13:44:41 +11:00
Matthew Honnibal
9dd2f25c74
* Fix Issue #131 : Force whitespace characters to attach syntactically to previous token, and ensure they cannot serve as stand-alone 'sentence' units.
2015-10-10 15:53:30 +11:00
Matthew Honnibal
8b39feefbe
* Add dependency post-process rule to ensure spaces are attached to neighbouring tokens, so that they can't be sentence boundaries
2015-10-10 15:32:13 +11:00
Matthew Honnibal
0e24d099a1
* Fix L/R edge bug, by ensuring l_edge and r_edge are preset, and fixing the way the edge update in del_arc. Bugs keep arising here because the edges are absolute positions, where everything else is relative. I'm also not 100% convinced that del_arc is handled correctly. Do we need to update the parents?
2015-09-09 03:40:44 +02:00
Matthew Honnibal
86c888667f
* Merge in changes from de branch
2015-09-06 19:49:28 +02:00
Matthew Honnibal
5edac11225
* Wrap self.parse in nogil, and break if an invalid move is predicted. The invalid break is a work-around that papers over likely bugs, but we can't easily break in the nogil block, and otherwise we'll get an infinite loop. Need to set this as an error flag.
2015-09-06 04:15:00 +02:00
Matthew Honnibal
a3d5e6c0dd
* Reform constructor and save/load workflow in parser model
2015-08-26 19:19:01 +02:00
Matthew Honnibal
bf38b3b883
* Hack on l/r reversal bug
2015-08-10 05:58:43 +02:00
Matthew Honnibal
6116413b47
* Fix label prediction in StepwiseState
2015-08-10 05:05:31 +02:00
Matthew Honnibal
2c9753eff2
* Whitespace
2015-08-10 00:09:02 +02:00
Matthew Honnibal
9de98f5a6f
* Add Parser.stepthrough method, with context manager
2015-08-10 00:08:46 +02:00
Matthew Honnibal
fe43f8cf39
* Whitespace
2015-08-09 02:31:53 +02:00
Matthew Honnibal
9c090945e0
* Add Parser.predict method, and clean up Parser.get_state
2015-08-09 02:29:58 +02:00
Matthew Honnibal
04fccfb984
* Fix get_state for parser prediction
2015-08-09 02:11:22 +02:00
Matthew Honnibal
55fde0e240
* Fix get_state
2015-08-09 01:45:30 +02:00
Matthew Honnibal
f0f4fa9838
* Fix Parser.get_state
2015-08-09 01:40:13 +02:00
Matthew Honnibal
18331dca89
* Add continue_for argument to parser 'partial' function, which is now renamed to get_state
2015-08-09 01:31:54 +02:00
Matthew Honnibal
0653288fa5
* Fix stateclass.queue
2015-08-09 00:39:02 +02:00
Matthew Honnibal
9de218b7ba
* Fix Parser.partial function
2015-08-08 23:45:18 +02:00
Matthew Honnibal
cc9deae960
* Add is_valid method to transition_system
2015-08-08 23:36:18 +02:00
Matthew Honnibal
2a46c77324
* Whitespace
2015-08-08 23:35:59 +02:00
Matthew Honnibal
7bafc789e7
* Add stack and queue properties to stateclass, for python access
2015-08-08 23:32:42 +02:00
Matthew Honnibal
3af938365f
* Add function partial to Parser
2015-08-08 23:32:15 +02:00
Matthew Honnibal
76a1f0481a
* Whitespace
2015-08-08 23:31:54 +02:00
Matthew Honnibal
59c3bf60a6
* Ensure entity recognizer doesn't over-write preset types
2015-08-06 16:09:08 +02:00
Matthew Honnibal
9c1724ecae
* Gazetteer stuff working, now need to wire up to API
2015-08-06 00:35:40 +02:00
Matthew Honnibal
a8bbd7312c
* Hackishly patch long dependencies problem
2015-07-28 00:14:29 +02:00
Matthew Honnibal
bb583f7f09
* Hackishly patch long dependencies problem
2015-07-27 23:14:33 +02:00
Matthew Honnibal
823ef4a00b
* Remove profile declarations
2015-07-25 18:13:06 +02:00
Matthew Honnibal
aa28e2e01d
* Release the GIL around parse function
2015-07-24 04:53:27 +02:00
Matthew Honnibal
d5255aad77
* Update freqs for missing tags in ner, for serializer
2015-07-23 01:17:11 +02:00
Matthew Honnibal
12699a1152
* Set initial freqs, to avoid missing values in serializer
2015-07-23 01:16:27 +02:00
Matthew Honnibal
317cbbc015
* Serialization round trip now working with decent API, but with rough spots in the organisation and requiring vocabulary to be fixed ahead of time.
2015-07-19 15:18:17 +02:00
Matthew Honnibal
b1d74ce60d
* Remove unused joint.pyx and joint.pxd files
2015-07-17 23:31:44 +02:00
Matthew Honnibal
fb0a641a2d
* Don't release the gil around Parser.parse. Does this indicate thread problems?
2015-07-17 23:07:37 +02:00
Matthew Honnibal
e29daea85f
* Fix bint/int typing problem in TransitionSystem. In C++ bint* means bool*, but in C it means int*. So, type-casting to bint* is unsafe.
2015-07-17 22:37:24 +02:00
Matthew Honnibal
45ae1ce428
* Remove unused declaration in parser
2015-07-16 01:27:11 +02:00
Matthew Honnibal
9a8db9743c
* Remove gil from parser.call
2015-07-14 23:47:33 +02:00
Matthew Honnibal
38ca0c33f5
Merge branch 'neuralnet' into refactor
...
Mostly refactors parser, to use new thinc3.2 Example class.
Aim is to remove use of shared memory, so that we can parallelize
over documents easily.
Conflicts:
setup.py
spacy/syntax/parser.pxd
spacy/syntax/parser.pyx
spacy/syntax/stateclass.pyx
2015-07-14 14:13:47 +02:00
Matthew Honnibal
6eef0bf9ab
* Break up tokens.pyx into tokens/doc.pyx, tokens/token.pyx, tokens/spans.pyx
2015-07-13 20:20:58 +02:00
Matthew Honnibal
55f1042443
* Improve efficiency of L and R features, correcting the non-linear-in-length problem.
2015-07-09 12:17:26 +02:00
Matthew Honnibal
70d2acb579
* Fix edge features
2015-07-09 12:15:01 +02:00
Matthew Honnibal
adb868bdad
* Add warning for models not found in parser
2015-07-08 20:04:55 +02:00
Matthew Honnibal
05b28ec9eb
* Add warning for models not found in parser
2015-07-08 20:02:13 +02:00
Matthew Honnibal
ef700401a6
* Add warning for models not found in parser
2015-07-08 20:00:46 +02:00
Matthew Honnibal
6218d8b389
* Add warning for models not found in parser
2015-07-08 19:59:16 +02:00
Matthew Honnibal
f6a6c39ce8
* Add warning for models not found in parser
2015-07-08 19:52:30 +02:00
Matthew Honnibal
0ceb1f71c2
* Update parse features
2015-07-08 19:11:36 +02:00
Matthew Honnibal
bb522496dd
* Rename Tokens to Doc
2015-07-08 18:53:00 +02:00
Matthew Honnibal
ff885e8511
* Add ParserFactory convenience function
2015-07-08 12:35:46 +02:00
Matthew Honnibal
52fd80c6c6
* Add experimental supersense features for parsing, based on lookup into wordnet.
2015-07-01 20:12:44 +02:00
Matthew Honnibal
e20106fdff
* Begin reorganizing neuralnet work
2015-06-30 14:26:32 +02:00
Matthew Honnibal
3bb5876c5a
* Inline methods in StateClass
2015-06-29 01:10:14 +02:00
Matthew Honnibal
313a7f87b3
* Inline methods in StateClass
2015-06-29 01:06:28 +02:00
Matthew Honnibal
a02fd3af5d
* Check valency in L and R feature methods, to make feaure calculation faster
2015-06-29 00:27:56 +02:00
Matthew Honnibal
5d870720bc
* Check valency in L and R feature methods, to make feaure calculation faster
2015-06-29 00:17:29 +02:00
Matthew Honnibal
f4986d5d3c
* Use new Example class
2015-06-28 22:36:03 +02:00
Matthew Honnibal
735f1af91f
* Fix neural net stuff
2015-06-28 11:44:58 +02:00
Matthew Honnibal
e7003f1cf3
* Remove hard-coding of vector lengths
2015-06-28 11:37:17 +02:00
Matthew Honnibal
897dd0dd0b
* Merge changes, and adjust Example to use memoryview
2015-06-28 11:36:11 +02:00
Matthew Honnibal
9282a8e72c
* Prepare for new models to be plugged in by using Example class
2015-06-28 11:02:35 +02:00
Matthew Honnibal
75aeccc064
* Rejig parser interface to use new thinc.api.Example class, in prep of theano model. Comment out beam search
2015-06-28 11:02:34 +02:00
Matthew Honnibal
bbef71f213
* Fix min function in fill_context
2015-06-28 10:46:39 +02:00
Matthew Honnibal
142b6f9510
* Revert last changes
2015-06-28 10:44:28 +02:00
Matthew Honnibal
b06962f18b
* Pad buffers in state
2015-06-28 10:36:14 +02:00
Matthew Honnibal
53be72387c
* Hack at fill_context to investigate performance loss
2015-06-28 10:34:28 +02:00
Matthew Honnibal
71a4e876a9
* Fix parse features
2015-06-28 09:27:33 +02:00
Matthew Honnibal
5af500909c
* Remove unused directve from parser.pyx
2015-06-28 06:20:21 +02:00
Matthew Honnibal
d5b4090705
* Add profile directive
2015-06-28 06:19:33 +02:00
Matthew Honnibal
2b5421e60c
* Add profile directive
2015-06-28 06:07:04 +02:00
Matthew Honnibal
8b5de4a411
* Add word / tag / label sets, for use in neural net
2015-06-28 05:46:53 +02:00
Matthew Honnibal
ed40a8380e
* Remove hard-coding of vector lengths
2015-06-27 04:18:47 +02:00
Matthew Honnibal
ebe630cc8d
* Enable more features for NN
2015-06-27 04:17:29 +02:00
Matthew Honnibal
f8bb43475e
* Bridge to Theano working. Very disorganised. Using thinc adb60aba966ed2
2015-06-27 02:39:18 +02:00
Matthew Honnibal
2fe98b8a9a
* Prepare for new models to be plugged in by using Example class
2015-06-26 13:51:39 +02:00
Matthew Honnibal
6896455884
* Rejig parser interface to use new thinc.api.Example class, in prep of theano model. Comment out beam search
2015-06-26 06:25:36 +02:00
Matthew Honnibal
02b171ee67
* Bug fixes to edge calculation
2015-06-24 04:28:02 +02:00
Matthew Honnibal
7f9384f53c
* Remove deprecated _state module
2015-06-23 17:28:24 +02:00
Matthew Honnibal
6dbe182491
* Fix merge conflicts
2015-06-23 17:28:00 +02:00
Matthew Honnibal
579735a095
* Remove import of _state module
2015-06-23 17:25:08 +02:00
Matthew Honnibal
88f55d136b
* Remove deprecated _state module
2015-06-23 17:19:51 +02:00
Matthew Honnibal
9ab9dd2bf7
* Clean up unused orig_arc_eager and tree_arc_eager modules, which were only added for EMNLP experiments
2015-06-23 17:17:33 +02:00
Matthew Honnibal
7ebfe4b983
* Fixes to edge features
2015-06-23 16:32:54 +02:00
Matthew Honnibal
7b125f5a86
* Fixes to edge features
2015-06-23 16:31:01 +02:00
Matthew Honnibal
35c290bee4
* Fix edge features
2015-06-23 15:50:56 +02:00
Matthew Honnibal
221e2e485f
* Assign 'ROOT' as label, not 'root'
2015-06-23 15:09:54 +02:00
Matthew Honnibal
a7bf7b0626
* Rename sent_start to sent_end, to reflect its new usage in the Break transition
2015-06-23 05:39:43 +02:00
Matthew Honnibal
ee3e56f27b
* Fix bounds checking on entities
2015-06-23 04:35:08 +02:00
Matthew Honnibal
43ef5ddea5
* Ensure root albel is spelled ROOT, for backwards compatibility
2015-06-23 04:14:03 +02:00
Matthew Honnibal
065c2e1d2d
* Add some bounds checking around state arrays
2015-06-23 04:13:09 +02:00
Matthew Honnibal
f01b3d043e
* Add padding to arrays in stateclass. May be papering over a deeper bug.
2015-06-23 03:03:41 +02:00
Matthew Honnibal
69507bc729
* Re-enable Break transition in arc_eager.pyx
2015-06-23 00:03:30 +02:00
Matthew Honnibal
ab110be125
* Remove debugging in parser.pyx
2015-06-16 23:37:25 +02:00
Matthew Honnibal
9b13d11ab3
* Fix handling of entities in StateClass
2015-06-16 23:35:21 +02:00
Matthew Honnibal
c40a2c661c
* Add tree_arc_eager
2015-06-15 08:23:24 +02:00
Matthew Honnibal
5da5cf7084
* Add some more features for S1/S0
2015-06-15 04:07:13 +02:00
Matthew Honnibal
8156a01bca
* Fix root label for orig_arc_eager
2015-06-15 02:54:55 +02:00
Matthew Honnibal
21930ede15
* Switch toggle on USE_ROOT_ARC_SEGMENT
2015-06-15 02:54:32 +02:00
Matthew Honnibal
38a6afa484
* Make possibly dubious correction to the unshift oracle
2015-06-15 02:50:00 +02:00
Matthew Honnibal
f66228f253
* Add some more features, esp for labels
2015-06-14 21:18:02 +02:00
Matthew Honnibal
3da8e0f317
* Add orig_arc_eager
2015-06-14 20:31:44 +02:00
Matthew Honnibal
ea8a103007
* Fix import of TransitionSystem in parser.pyx
2015-06-14 19:01:26 +02:00
Matthew Honnibal
e0984ca139
* Fix valency features in StateClass
2015-06-14 17:50:26 +02:00
Matthew Honnibal
763cbd23d5
* Upd stateclass.print_state
2015-06-14 17:44:29 +02:00
Matthew Honnibal
bdd07bf000
* Fix Break oracle, but disable the Break transition for now, while we finalize the gold-standard experiments
2015-06-14 17:44:03 +02:00
Matthew Honnibal
399f15fbdf
* Add flag to toggle handling of multi-root inputs without the Break transition. Clear up now unused best_valid stuff.
2015-06-14 00:28:37 +02:00
Matthew Honnibal
75289b4761
* Don't refuse to parse single token sentences, incase some transition system needs them, e.g. single word entity. Instead fix error in _init_state.
2015-06-13 22:55:55 +02:00
Matthew Honnibal
77d7e79c7e
* Fix r/l and distance features.
2015-06-12 13:06:15 +02:00
Matthew Honnibal
15e177d7a1
* Fixes to unshift/fast-forward strategy. Getting 91.55 greedy on NW dev, gold preproc
2015-06-12 01:50:23 +02:00
Matthew Honnibal
afd77a529b
* Prepare for break transition, with fast-forwarding. 86.5 on 1k nw gold preproc
2015-06-10 14:08:30 +02:00
Matthew Honnibal
495f528709
* Add support for sentence breaks in stateclass
2015-06-10 12:34:28 +02:00
Matthew Honnibal
b7b18c279d
* Fix Reduce oracle. Getting 86.35
2015-06-10 11:33:39 +02:00
Matthew Honnibal
bb09b5d91a
* Fix shifted bit vector in stateclass --- should reflect whether the word has been *unshifted*.
2015-06-10 11:33:09 +02:00
Matthew Honnibal
aa9625f688
* Do non-monotonic Unshift. Every word can be shifted at most 1 time. When the Reduce move is used, if S0 has no head, we put the word back on the buffer. Gets 86.4 on nw 1k with gold pre-proc. Break transition not yet implemented for this.
2015-06-10 10:15:56 +02:00
Matthew Honnibal
7bf6b7de3e
* Add unshift action to StateClass, and track which moves have been shifted
2015-06-10 10:13:03 +02:00
Matthew Honnibal
f7c8069e65
* Fix bug in distance feature
2015-06-10 10:12:17 +02:00
Matthew Honnibal
abd07c067a
* Inline B and S methods on stateclass
2015-06-10 07:22:33 +02:00
Matthew Honnibal
e2f9a80713
* Remove old _state imports
2015-06-10 07:09:17 +02:00
Matthew Honnibal
e9aaecc619
* Remove from_struct method from StateClass
2015-06-10 06:58:27 +02:00
Matthew Honnibal
18cc326dc0
* Bug fixes to ner.pyx
2015-06-10 06:57:41 +02:00
Matthew Honnibal
e5570c9700
* Set nogil for oracle functions
2015-06-10 06:56:56 +02:00
Matthew Honnibal
4575e7a60f
* Fix beam search with new StateClass
2015-06-10 06:33:39 +02:00
Matthew Honnibal
04b1cd9b8c
* Greedy parsing working with new StateClass. Beam parsing broken
2015-06-10 04:20:23 +02:00
Matthew Honnibal
6a94b64eca
* Remove State* from parser.pyx entirely, switching over to StateClass. Beam parsing still untested.
2015-06-10 02:03:38 +02:00
Matthew Honnibal
f14a1526aa
* Remove version of fill_context that takes State*
2015-06-10 01:39:07 +02:00
Matthew Honnibal
d68c686ec1
* Move StateClass into interface of transition functions
2015-06-10 01:35:28 +02:00
Matthew Honnibal
4b98b3e9c8
* Cost functions now take StateClass argument, instead of State*.
2015-06-10 00:40:43 +02:00
Matthew Honnibal
e0cf61f591
* Move StateClass into the interface for is_valid
2015-06-09 23:23:28 +02:00
Matthew Honnibal
0895d454fb
* Prepare to switch to using state class, instead of state struct
2015-06-09 21:20:14 +02:00
Matthew Honnibal
2b9629ed62
* Begin adding stateclass to ArcEager
2015-06-09 01:41:09 +02:00
Matthew Honnibal
ba10fd8af5
* Add StateClass, to replace/refactor the mess in _state
2015-06-09 01:39:54 +02:00
Matthew Honnibal
c7e3dfc1dc
* Don't automatically push words when stack is empty, as it messes up beam parsing. Add hash method to beam state.
2015-06-08 14:49:04 +02:00
Matthew Honnibal
6e2564239d
* Bug fixes to beam parser. Search still broken on non-gold sentences
2015-06-07 19:12:59 +02:00
Matthew Honnibal
731e5f1e46
* Add get() function in spacy/syntax/Config
2015-06-07 19:09:15 +02:00
Matthew Honnibal
8f142c1838
* Refactor transition system oracles, to split out move and label cost. Preparing to add Unshift move. Will exclude non-monotonic.
2015-06-07 03:21:29 +02:00
Matthew Honnibal
1fee7ade61
* Tweak to ner
2015-06-05 23:48:43 +02:00
Matthew Honnibal
33e70b167f
* Remove dead code from ner.pyx
2015-06-05 17:12:47 +02:00
Matthew Honnibal
88ac5c6e98
* Send beam_width < 0 to greedy parser
2015-06-05 17:12:06 +02:00
Matthew Honnibal
0114e7600d
* Fix NER oracle
2015-06-05 17:11:26 +02:00
Matthew Honnibal
6bf35cecc3
* Refactor transition system to use classes with staticmethods.
2015-06-05 02:27:17 +02:00
Matthew Honnibal
36a34d544b
* Refactoring arc_eager, grouping oracle functions into transitions
2015-06-04 22:43:03 +02:00
Matthew Honnibal
4433396005
* Impove efficiency of dynamic oracle, making beam training faster
2015-06-04 21:15:14 +02:00
Matthew Honnibal
079dad28a7
* Update for faster beam training
2015-06-04 19:32:32 +02:00
Matthew Honnibal
a2627b6102
* Fix bug in refactored init_transition
2015-06-03 06:01:26 +02:00
Matthew Honnibal
dd0867645d
* Remove stray const from State header
2015-06-03 00:10:04 +02:00
Matthew Honnibal
6c47b10a6e
* Make optimization to children_in_buffer: stop searching when we would cross a bracket.
2015-06-02 21:05:24 +02:00
Matthew Honnibal
a513ec500f
* Have oracle functions take a struct instead of a Python object
2015-06-02 20:01:06 +02:00
Matthew Honnibal
d1b55310a1
* Refactor _advance_beam function
2015-06-02 18:38:41 +02:00
Matthew Honnibal
0786d9b3c7
* Refactor TransitionSystem, adding set_valid method
2015-06-02 18:38:07 +02:00
Matthew Honnibal
a3964957f6
* Add profiling for _state.pyx
2015-06-02 18:36:27 +02:00
Matthew Honnibal
e822df0867
* Fix bugs in new greedy/beam parser
2015-06-02 02:01:33 +02:00
Matthew Honnibal
66dfa95847
* Revise greedy_parse/beam_parse ownership goof
2015-06-02 01:34:19 +02:00
Matthew Honnibal
75658b2ed3
* Remove use of new beam.loss property, to maintain compatibility with older versions of thinc for now.
2015-06-02 00:57:09 +02:00
Matthew Honnibal
7c29362d60
* Rename parser class in parser.pxd, now that beam parsing is supported
2015-06-02 00:53:49 +02:00
Matthew Honnibal
58d5ac0944
* Add beam search capabilities to Parser. Rename GreedyParser to Parser.
2015-06-02 00:28:02 +02:00
Matthew Honnibal
e09a08bd00
* Add copy_state function
2015-06-01 23:06:30 +02:00
Matthew Honnibal
c7876aa8b6
* Add get_valid method
2015-06-01 23:06:00 +02:00
Matthew Honnibal
5e99ff94c8
* Edits to arc eager oracle. Couldn't figure out how the non-monotonic lines made sense. They seem covered by children_in_stack
2015-05-31 15:14:37 +02:00
Matthew Honnibal
6c5632b71c
* Roll back proposed change to Break transition while investigate effect
2015-05-31 06:49:52 +02:00
Matthew Honnibal
e77940565d
* Add length cap to distance feature
2015-05-31 05:25:30 +02:00
Matthew Honnibal
fd596351ba
* Fix valency features
2015-05-31 05:24:33 +02:00
Matthew Honnibal
76300bbb1b
* Use updated JSON format, with sentences below paragraphs. Allows use of gold preprocessing flag.
2015-05-30 01:25:46 +02:00
Matthew Honnibal
8f31d3b864
* Relax constraint on Break transition for non-monotonic parsing.
2015-05-28 23:39:52 +02:00
Matthew Honnibal
4010b9b6d9
* Pass parameter for regularization in parser.pyx
2015-05-27 03:18:50 +02:00
Matthew Honnibal
fc75210941
* Move spacy.syntax.conll to spacy.gold
2015-05-24 21:35:02 +02:00
Matthew Honnibal
efe7a7d7d6
* Clean unused functions from spacy.syntax.conll
2015-05-24 20:06:46 +02:00
Matthew Honnibal
78487f3e66
* Update parser oracle for missing heads
2015-05-24 20:05:58 +02:00
Matthew Honnibal
acd1245ad4
* Remove cruft from conll.pyx --- unused stuff about evlauation, which now lives in spacy.scorer
2015-05-24 17:35:49 +02:00
Matthew Honnibal
20f1d868a3
* Tmp commit. Working on whole document parsing
2015-05-24 02:49:56 +02:00
Matthew Honnibal
f2ee9c4feb
* Comment out constituency parsing stuff, so that code compiles
2015-05-20 16:55:05 +02:00
Matthew Honnibal
9dfc9c039c
* Work on constituency parsing.
2015-05-20 16:02:51 +02:00
Matthew Honnibal
ba07b925a7
* Fix compile error in conll.pyx
2015-05-12 22:33:47 +02:00
Matthew Honnibal
f1e0272b18
* Disable c-parsing transitions
2015-05-12 22:33:25 +02:00
Matthew Honnibal
03a6626545
* Tmp commit
2015-05-12 20:27:56 +02:00
Matthew Honnibal
9568ebed08
* Fix off-by-one in head reading
2015-05-12 20:27:56 +02:00
Matthew Honnibal
d2ac8d8007
* Add ctnt field to State, in preparation for constituency parsing
2015-05-12 20:27:56 +02:00
Matthew Honnibal
ab67693393
* Add read_json_file to conll.pyx
2015-05-12 20:27:55 +02:00
Matthew Honnibal
aff9359a8d
* Update ner.pyx to expect brackets from gold_tuples
2015-05-12 20:27:55 +02:00