Commit Graph

80 Commits

Author SHA1 Message Date
Matthew Honnibal
bede11b67c
Improve label management in parser and NER (#2108)
This patch does a few smallish things that tighten up the training workflow a little, and allow memory use during training to be reduced by letting the GoldCorpus stream data properly.

Previously, the parser and entity recognizer read and saved labels as lists, with extra labels noted separately. Lists were used becaue ordering is very important, to ensure that the label-to-class mapping is stable.

We now manage labels as nested dictionaries, first keyed by the action, and then keyed by the label. Values are frequencies. The trick is, how do we save new labels? We need to make sure we iterate over these in the same order they're added. Otherwise, we'll get different class IDs, and the model's predictions won't make sense.

To allow stable sorting, we map the new labels to negative values. If we have two new labels, they'll be noted as having "frequency" -1 and -2. The next new label will then have "frequency" -3. When we sort by (frequency, label), we then get a stable sort.

Storing frequencies then allows us to make the next nice improvement. Previously we had to iterate over the whole training set, to pre-process it for the deprojectivisation. This led to storing the whole training set in memory. This was most of the required memory during training.

To prevent this, we now store the frequencies as we stream in the data, and deprojectivize as we go. Once we've built the frequencies, we can then apply a frequency cut-off when we decide how many classes to make.

Finally, to allow proper data streaming, we also have to have some way of shuffling the iterator. This is awkward if the training files have multiple documents in them. To solve this, the GoldCorpus class now writes the training data to disk in msgpack files, one per document. We can then shuffle the data by shuffling the paths.

This is a squash merge, as I made a lot of very small commits. Individual commit messages below.

* Simplify label management for TransitionSystem and its subclasses

* Fix serialization for new label handling format in parser

* Simplify and improve GoldCorpus class. Reduce memory use, write to temp dir

* Set actions in transition system

* Require thinc 6.11.1.dev4

* Fix error in parser init

* Add unicode declaration

* Fix unicode declaration

* Update textcat test

* Try to get model training on less memory

* Print json loc for now

* Try rapidjson to reduce memory use

* Remove rapidjson requirement

* Try rapidjson for reduced mem usage

* Handle None heads when projectivising

* Stream json docs

* Fix train script

* Handle projectivity in GoldParse

* Fix projectivity handling

* Add minibatch_by_words util from ud_train

* Minibatch by number of words in spacy.cli.train

* Move minibatch_by_words util to spacy.util

* Fix label handling

* More hacking at label management in parser

* Fix encoding in msgpack serialization in GoldParse

* Adjust batch sizes in parser training

* Fix minibatch_by_words

* Add merge_subtokens function to pipeline.pyx

* Register merge_subtokens factory

* Restore use of msgpack tmp directory

* Use minibatch-by-words in train

* Handle retokenization in scorer

* Change back-off approach for missing labels. Use 'dep' label

* Update NER for new label management

* Set NER tags for over-segmented words

* Fix label alignment in gold

* Fix label back-off for infrequent labels

* Fix int type in labels dict key

* Fix int type in labels dict key

* Update feature definition for 8 feature set

* Update ud-train script for new label stuff

* Fix json streamer

* Print the line number if conll eval fails

* Update children and sentence boundaries after deprojectivisation

* Export set_children_from_heads from doc.pxd

* Render parses during UD training

* Remove print statement

* Require thinc 6.11.1.dev6. Try adding wheel as install_requires

* Set different dev version, to flush pip cache

* Update thinc version

* Update GoldCorpus docs

* Remove print statements

* Fix formatting and links [ci skip]
2018-03-19 02:58:08 +01:00
Matthew Honnibal
e361b4f82b Fix #1929: Incorrect NER when pre-set sentence boundaries. 2018-02-08 15:25:41 +01:00
Matthew Honnibal
2512ea9eeb Fix memory leak in beam parser 2017-11-14 02:11:40 +01:00
ines
b4d226a3f1 Tidy up syntax 2017-10-27 19:45:57 +02:00
Matthew Honnibal
92c5d78b42 Unhack NER.add_action 2017-10-07 19:02:40 +02:00
Matthew Honnibal
c003c561c3 Revert NER action loading change, for model compatibility 2017-09-17 05:46:03 -05:00
Matthew Honnibal
8c503487af Fix lookup of missing NER actions 2017-09-14 16:59:45 +02:00
Matthew Honnibal
daf869ab3b Fix add_action for NER, so labelled 'O' actions aren't added 2017-09-14 16:16:41 +02:00
Matthew Honnibal
84b7ed49e4 Ensure updates aren't made if no gold available 2017-08-20 14:41:38 +02:00
Matthew Honnibal
27abc56e98 Add method to get beam entities 2017-07-29 21:59:02 +02:00
Matthew Honnibal
3da1063b36 Add beam decoding to parser, to allow NER uncertainties 2017-07-20 15:02:55 +02:00
Matthew Honnibal
0ca5832427 Improve negative example handling in NER oracle 2017-07-20 00:18:49 +02:00
Matthew Honnibal
7996d21717 Fixes for new StringStore 2017-05-28 11:09:27 -05:00
Matthew Honnibal
84e66ca6d4 WIP on stringstore change. 27 failures 2017-05-28 14:06:40 +02:00
Matthew Honnibal
99316fa631 Use ordered dict to specify actions 2017-05-27 15:50:21 -05:00
Matthew Honnibal
3d5a536eaa Improve efficiency of parser batching 2017-05-26 11:31:23 -05:00
Matthew Honnibal
e2136232f9 Exclude states with no matching gold annotations from parsing 2017-05-22 10:30:12 -05:00
Matthew Honnibal
8b04b0af9f Remove freqs from transition_system 2017-05-20 02:20:48 -05:00
ines
0739ae7b76 Tidy up and fix formatting and imports 2017-04-15 13:05:15 +02:00
Matthew Honnibal
354458484c WIP on add_label bug during NER training
Currently when a new label is introduced to NER during training,
it causes the labels to be read in in an unexpected order. This
invalidates the model.
2017-04-14 23:52:17 +02:00
Matthew Honnibal
2611ac2a89 Fix scorer bug for NER, related to ambiguity between missing annotations and misaligned tokens 2017-03-16 09:38:28 -05:00
Matthew Honnibal
931feb3360 Allow beam parsing for NER 2017-03-11 11:12:01 -06:00
Matthew Honnibal
159e8c46e1 Merge old training fixes with newer state 2016-11-25 09:16:36 -06:00
Matthew Honnibal
39341598bb Fix NER label calculation 2016-11-25 09:02:22 -06:00
Matthew Honnibal
301f3cc898 Fix Issue #429. Add an initialize_state method to the named entity recogniser that adds missing entity types. This is a messy place to add this, because it's strange to have the method mutate state. A better home for this logic could be found. 2016-10-27 18:01:55 +02:00
Matthew Honnibal
f787cd29fe Refactor the pipeline classes to make them more consistent, and remove the redundant blank() constructor. 2016-10-16 21:34:57 +02:00
Matthew Honnibal
9e09b39b9f Revert "Changes to transition systems for new StringStore scheme"
This reverts commit 0442e0ab1e.
2016-09-30 20:11:49 +02:00
Matthew Honnibal
0442e0ab1e Changes to transition systems for new StringStore scheme 2016-09-30 19:58:51 +02:00
Matthew Honnibal
a47f00901b * Pass a StateC pointer into the transition and validation methods in the parser, so that the GIL can be released over a batch of documents 2016-02-01 02:58:14 +01:00
Matthew Honnibal
daaad66448 * Now fully proxied 2016-02-01 02:37:08 +01:00
Matthew Honnibal
10877a7791 * Update for thinc 5.0, including changing cost from int to weight_t, and updating the tagger and parser 2016-01-30 14:31:36 +01:00
Matthew Honnibal
c8e0011ebc * Add iterators to the NER and parser transition systems, to get the action types 2016-01-19 19:07:43 +01:00
Matthew Honnibal
5623242b3e * Adjust NER rules, so that U entries in gazetteer don't become B moves to the model 2015-11-12 04:48:23 +11:00
Matthew Honnibal
44fbdc7260 * Fix bug in NER transition system, that sometimes left no valid moves 2015-11-08 16:19:12 +01:00
Matthew Honnibal
e92371bb54 * Fix rule that made Last action invalid if there was a preset of O, since if the entity is already open, that ship has sailed. 2015-11-08 22:17:51 +11:00
Matthew Honnibal
af70dc166a * Fix Last restriction, that was supposed to prevent conflicts with presets, but was incorrect. 2015-11-07 09:52:00 +11:00
Matthew Honnibal
d24b8509e4 * Correct screw ups from the previous commits 2015-11-07 06:51:41 +11:00
Matthew Honnibal
5efad178b5 * Set ent tag when close entity 2015-11-07 06:09:25 +11:00
Matthew Honnibal
01ab464383 * Prevent Begin and In moves from applying in NER if we're at the last token of a sentence, as this would mean the entity would span over a sentence boundary. Re Issue #169 2015-11-07 05:30:44 +11:00
Matthew Honnibal
fe43f8cf39 * Whitespace 2015-08-09 02:31:53 +02:00
Matthew Honnibal
59c3bf60a6 * Ensure entity recognizer doesn't over-write preset types 2015-08-06 16:09:08 +02:00
Matthew Honnibal
9c1724ecae * Gazetteer stuff working, now need to wire up to API 2015-08-06 00:35:40 +02:00
Matthew Honnibal
d5255aad77 * Update freqs for missing tags in ner, for serializer 2015-07-23 01:17:11 +02:00
Matthew Honnibal
317cbbc015 * Serialization round trip now working with decent API, but with rough spots in the organisation and requiring vocabulary to be fixed ahead of time. 2015-07-19 15:18:17 +02:00
Matthew Honnibal
75aeccc064 * Rejig parser interface to use new thinc.api.Example class, in prep of theano model. Comment out beam search 2015-06-28 11:02:34 +02:00
Matthew Honnibal
579735a095 * Remove import of _state module 2015-06-23 17:25:08 +02:00
Matthew Honnibal
15e177d7a1 * Fixes to unshift/fast-forward strategy. Getting 91.55 greedy on NW dev, gold preproc 2015-06-12 01:50:23 +02:00
Matthew Honnibal
e2f9a80713 * Remove old _state imports 2015-06-10 07:09:17 +02:00
Matthew Honnibal
18cc326dc0 * Bug fixes to ner.pyx 2015-06-10 06:57:41 +02:00
Matthew Honnibal
d68c686ec1 * Move StateClass into interface of transition functions 2015-06-10 01:35:28 +02:00