Commit Graph

100 Commits

Author SHA1 Message Date
Matthew Honnibal
e31ef9c7f6 Add some property vars for testing 2018-04-03 15:44:31 +02:00
Matthew Honnibal
e9d1e6d66b Fix head alignment for split tokens 2018-04-03 02:32:09 +02:00
Matthew Honnibal
9c5c940441 Fix head alignment in GoldParse 2018-04-03 01:54:45 +02:00
Matthew Honnibal
06a5be9dfd Fix handling of heads for undersegmented tokens 2018-04-03 00:55:05 +02:00
Matthew Honnibal
c8ba54e052 Fix Alignment class for undersegmentation 2018-04-02 23:39:26 +02:00
Matthew Honnibal
e6641a11b1 Refactor alignment into its own class 2018-04-02 21:54:29 +02:00
Matthew Honnibal
fb9c3984b5 Add GoldParse.resize_arrays method 2018-04-01 22:10:53 +02:00
Matthew Honnibal
cb6988f2f4 Fix comment in GoldParse 2018-04-01 22:10:26 +02:00
Matthew Honnibal
3d182fbc43 Represent fused tokens in GoldParse
Entries in GoldParse.{words, heads, tags, deps, ner} can now be lists
instead of single values, to handle getting the analysis for fused
tokens. For instance, let's say we have a token like "hows", while the
gold-standard has two tokens, ["how", "s"]. We need to store the gold
data for each of the two subtokens.

Example gold.words: [["how", "s"], "it", "going"]

Things get more complicated for heads, as we need to address particular
subtokens. Let's say the gold heads for ["how", "s", "it", "going"] is
[1, 1, 3, 1], i.e. the root "s" is within the subtoken. The gold.heads
list would be:

    [[(0, 1), (0, 1)], 2, (0, 1)]

The tuples indicate token 0, subtoken 1. A helper method
_flatten_fused_heads is available that unpacks the above to
[1, 1, 3, 1].
2018-04-01 17:18:18 +02:00
Matthew Honnibal
728d9841c7 Allocate fused tokens array in GoldParseC 2018-04-01 13:43:56 +02:00
Matthew Honnibal
1f7229f40f Revert "Merge branch 'develop' of https://github.com/explosion/spaCy into develop"
This reverts commit c9ba3d3c2d, reversing
changes made to 92c26a35d4.
2018-03-27 19:23:02 +02:00
ines
c699aec089 Add offsets_from_biluo_tags helper and tests (see #1626) 2017-11-26 16:38:01 +01:00
Matthew Honnibal
86ddf692a1 Fix bug in limit calculation on dev data 2017-11-14 01:37:10 +01:00
Matthew Honnibal
1cab703bba Move minibatch function to util 2017-11-06 23:45:36 +01:00
ines
d96e72f656 Tidy up rest 2017-10-27 21:07:59 +02:00
ines
a6135336f5 Tidy up gold 2017-10-27 17:02:55 +02:00
Matthew Honnibal
6e552c9d83 Prune number of non-projective labels more aggressiely 2017-10-11 02:46:44 -05:00
Matthew Honnibal
563f46f026 Fix multi-label support for text classification
The TextCategorizer class is supposed to support multi-label
text classification, and allow training data to contain missing
values.

For this to work, the gradient of the loss should be 0 when labels
are missing. Instead, there was no way to actually denote "missing"
in the GoldParse class, and so the TextCategorizer class treated
the label set within gold.cats as complete.

To fix this, we change GoldParse.cats to be a dict instead of a list.
The GoldParse.cats dict should map to floats, with 1. denoting
'present' and 0. denoting 'absent'. Gradients are zeroed for categories
absent from the gold.cats dict. A nice bonus is that you can also set
values between 0 and 1 for partial membership. You can also set numeric
values, if you're using a text classification model that uses an
appropriate loss function.

Unfortunately this is a breaking change; although the functionality
was only recently introduced and hasn't been properly documented
yet. I've updated the example script accordingly.
2017-10-05 18:43:02 -05:00
Matthew Honnibal
ba23d63c35 Fix minibatch function, for fixed batch size 2017-09-14 13:37:41 +02:00
Matthew Honnibal
4bb6bc3f9e Add support for sent_start to GoldParse 2017-08-25 20:03:14 -05:00
Matthew Honnibal
84b7ed49e4 Ensure updates aren't made if no gold available 2017-08-20 14:41:38 +02:00
Matthew Honnibal
ec63f4fe7b Add option to control how missing entities are handled when getting NER tags 2017-07-29 21:58:37 +02:00
Matthew Honnibal
9bae0ddc50 Fix minibatching 2017-07-22 20:14:49 +02:00
Matthew Honnibal
ed6c85fa3c Fix loading of text categories in GoldParse 2017-07-22 20:04:03 +02:00
Matthew Honnibal
7ea50182a5 Add support for text-classification labels to GoldParse 2017-07-20 00:17:47 +02:00
Matthew Honnibal
ebb6c49cd5 Make alignment case-insensitive for gold 2017-06-04 20:26:42 -05:00
Matthew Honnibal
fc4dd62e84 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-06-04 20:19:05 -05:00
Matthew Honnibal
a053b1218e Fix item counting during training 2017-06-04 20:18:20 -05:00
Matthew Honnibal
9bc4a26213 Add option of data augmentation noise 2017-06-04 20:16:57 -05:00
Matthew Honnibal
f6955a459c Fix prev commit 2017-06-03 14:38:37 -05:00
Matthew Honnibal
468ca6c760 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-06-03 14:33:51 -05:00
Matthew Honnibal
c647a0d33e Fix training counter for gold preprocessing 2017-06-03 14:33:39 -05:00
Matthew Honnibal
e62f46d39f Clarify gold.pyx slightly 2017-06-03 13:28:52 -05:00
Matthew Honnibal
be4a640f0c Fix arc eager label costs for uint64 2017-05-30 20:37:58 +02:00
Matthew Honnibal
84e66ca6d4 WIP on stringstore change. 27 failures 2017-05-28 14:06:40 +02:00
Matthew Honnibal
d06f235fc9 Fix conflict on convert.py 2017-05-26 11:33:29 -05:00
Matthew Honnibal
2e587c6417 Export iob_to_biluo utility 2017-05-26 11:32:55 -05:00
Matthew Honnibal
daac3e3573 Always shuffle gold data, and support length cap 2017-05-26 11:30:52 -05:00
Matthew Honnibal
3a6e59cc53 Add minibatch function in spacy.gold 2017-05-25 17:15:09 -05:00
Matthew Honnibal
3959d778ac Revert "Revert "WIP on improving parser efficiency""
This reverts commit 532afef4a8.
2017-05-23 03:06:53 -05:00
Matthew Honnibal
532afef4a8 Revert "WIP on improving parser efficiency"
This reverts commit bdaac7ab44.
2017-05-23 03:05:25 -05:00
Matthew Honnibal
bdaac7ab44 WIP on improving parser efficiency 2017-05-23 02:59:31 -05:00
Matthew Honnibal
c9760b2104 Support sentence limits in GoldCorpus 2017-05-22 10:40:46 -05:00
ines
54f04a9fe0 Update API docs with changes in spacy.gold and spacy.language 2017-05-22 12:29:30 +02:00
Matthew Honnibal
2a5eb9f61e Make nonproj methods top-level functions, instead of class methods 2017-05-22 04:51:08 -05:00
Matthew Honnibal
025d9bbc37 Fix handling of non-projective deps 2017-05-22 04:51:08 -05:00
Matthew Honnibal
f13d6c7359 Support gold preprocessing and single gold files 2017-05-22 04:51:08 -05:00
Matthew Honnibal
5db89053aa Merge docstrings 2017-05-21 13:46:23 -05:00
Matthew Honnibal
432b3499b3 Fix memory leak 2017-05-21 13:38:46 -05:00
Matthew Honnibal
4803b3b69e Add GoldCorpus class, to manage data streaming 2017-05-21 09:06:17 -05:00