Commit Graph

57 Commits

Author SHA1 Message Date
svlandeg
20389e4553 format and bugfix 2019-07-22 15:08:17 +02:00
svlandeg
41fb5204ba output tensors as part of predict 2019-07-19 14:47:36 +02:00
svlandeg
21176517a7 have gold.links correspond exactly to doc.ents 2019-07-19 12:36:15 +02:00
svlandeg
e1213eaf6a use original gold object in get_loss function 2019-07-18 13:35:10 +02:00
svlandeg
ec55d2fccd filter training data beforehand (+black formatting) 2019-07-18 10:22:24 +02:00
svlandeg
a63d15a142 code cleanup 2019-07-15 17:36:43 +02:00
svlandeg
60f299374f set default context width 2019-07-15 12:03:09 +02:00
Sofie Van Landeghem
c4c21cb428 more friendly textcat errors (#3946)
* more friendly textcat errors with require_model and require_labels

* update thinc version with recent bugfix
2019-07-10 19:39:38 +02:00
Ines Montani
f2ea3e3ea2
Merge branch 'master' into feature/nel-wiki 2019-07-09 21:57:47 +02:00
Ines Montani
547464609d Remove merge_subtokens from parser postprocessing for now 2019-07-09 21:50:30 +02:00
svlandeg
668b17ea4a deuglify kb deserializer 2019-07-03 15:00:42 +02:00
svlandeg
8840d4b1b3 fix for context encoder optimizer 2019-07-03 13:35:36 +02:00
svlandeg
2d2dea9924 experiment with adding NER types to the feature vector 2019-06-29 14:52:36 +02:00
svlandeg
c664f58246 adding prior probability as feature in the model 2019-06-28 16:22:58 +02:00
svlandeg
68a0662019 context encoder with Tok2Vec + linking model instead of cosine 2019-06-28 08:29:31 +02:00
Ines Montani
37f744ca00 Auto-format [ci skip] 2019-06-26 14:48:09 +02:00
svlandeg
1de61f68d6 improve speed of prediction loop 2019-06-26 13:53:10 +02:00
svlandeg
58a5b40ef6 clean up duplicate code 2019-06-24 15:19:58 +02:00
svlandeg
b58bace84b small fixes 2019-06-24 10:55:04 +02:00
svlandeg
cc9ae28a52 custom error and warning messages 2019-06-19 12:35:26 +02:00
svlandeg
791327e3c5 Merge remote-tracking branch 'upstream/master' into feature/nel-wiki 2019-06-19 09:44:05 +02:00
svlandeg
a31648d28b further code cleanup 2019-06-19 09:15:43 +02:00
svlandeg
478305cd3f small tweaks and documentation 2019-06-18 18:38:09 +02:00
svlandeg
0d177c1146 clean up code, remove old code, move to bin 2019-06-18 13:20:40 +02:00
svlandeg
ffae7d3555 sentence encoder only (removing article/mention encoder) 2019-06-18 00:05:47 +02:00
svlandeg
b312f2d0e7 redo training data to be independent of KB and entity-level instead of doc-level 2019-06-14 15:55:26 +02:00
svlandeg
78dd3e11da write entity linking pipe to file and keep vocab consistent between kb and nlp 2019-06-13 16:25:39 +02:00
svlandeg
b12001f368 small fixes 2019-06-12 22:05:53 +02:00
svlandeg
6521cfa132 speeding up training 2019-06-12 13:37:05 +02:00
svlandeg
fe1ed432ef eval on dev set, varying combo's of prior and context scores 2019-06-11 11:40:58 +02:00
svlandeg
83dc7b46fd first tests with EL pipe 2019-06-10 21:25:26 +02:00
Matthew Honnibal
a931d72459 Add merge_subtokens as parser post-process. Re #3830 2019-06-07 20:40:41 +02:00
svlandeg
7de1ee69b8 training loop in proper pipe format 2019-06-07 15:55:10 +02:00
svlandeg
0486ccabfd introduce goldparse.links 2019-06-07 13:54:45 +02:00
svlandeg
a5c061f506 storing NEL training data in GoldParse objects 2019-06-07 12:58:42 +02:00
svlandeg
61f0e2af65 code cleanup 2019-06-06 20:22:14 +02:00
svlandeg
5c723c32c3 entity vectors in the KB + serialization of them 2019-06-05 18:29:18 +02:00
svlandeg
9abbd0899f separate entity encoder to get 64D descriptions 2019-06-05 00:09:46 +02:00
svlandeg
fb37cdb2d3 implementing el pipe in pipes.pyx (not tested yet) 2019-06-03 21:32:54 +02:00
svlandeg
dd691d0053 debugging 2019-05-17 17:44:11 +02:00
Sofie
a4a6bfa4e1
Merge branch 'master' into feature/el-framework 2019-03-26 11:00:02 +01:00
svlandeg
8814b9010d entity as one field instead of both ID and name 2019-03-25 18:10:41 +01:00
Matthew Honnibal
6c783f8045 Bug fixes and options for TextCategorizer (#3472)
* Fix code for bag-of-words feature extraction

The _ml.py module had a redundant copy of a function to extract unigram
bag-of-words features, except one had a bug that set values to 0.
Another function allowed extraction of bigram features. Replace all three
with a new function that supports arbitrary ngram sizes and also allows
control of which attribute is used (e.g. ORTH, LOWER, etc).

* Support 'bow' architecture for TextCategorizer

This allows efficient ngram bag-of-words models, which are better when
the classifier needs to run quickly, especially when the texts are long.
Pass architecture="bow" to use it. The extra arguments ngram_size and
attr are also available, e.g. ngram_size=2 means unigram and bigram
features will be extracted.

* Fix size limits in train_textcat example

* Explain architectures better in docs
2019-03-23 16:44:44 +01:00
Ines Montani
06bf130890 💫 Add better and serializable sentencizer (#3471)
* Add better serializable sentencizer component

* Replace default factory

* Add tests

* Tidy up

* Pass test

* Update docs
2019-03-23 15:45:02 +01:00
svlandeg
5318ce88fa 'entity_linker' instead of 'el' 2019-03-22 13:55:10 +01:00
svlandeg
1ee0e78fd7 select candidate with highest prior probabiity 2019-03-22 11:36:45 +01:00
svlandeg
c593607ce2 minimal EL pipe 2019-03-22 11:36:45 +01:00
svlandeg
735fc2a735 annotate kb_id through ents in doc 2019-03-22 11:36:44 +01:00
svlandeg
d849eb2455 adding kb_id as field to token, el as nlp pipeline component 2019-03-22 11:34:46 +01:00
Ines Montani
cb5dbfa63a Tidy up references to n_threads and fix default 2019-03-15 16:24:26 +01:00