svlandeg
20389e4553
format and bugfix
2019-07-22 15:08:17 +02:00
svlandeg
41fb5204ba
output tensors as part of predict
2019-07-19 14:47:36 +02:00
svlandeg
21176517a7
have gold.links correspond exactly to doc.ents
2019-07-19 12:36:15 +02:00
svlandeg
e1213eaf6a
use original gold object in get_loss function
2019-07-18 13:35:10 +02:00
svlandeg
ec55d2fccd
filter training data beforehand (+black formatting)
2019-07-18 10:22:24 +02:00
svlandeg
a63d15a142
code cleanup
2019-07-15 17:36:43 +02:00
svlandeg
60f299374f
set default context width
2019-07-15 12:03:09 +02:00
Sofie Van Landeghem
c4c21cb428
more friendly textcat errors ( #3946 )
...
* more friendly textcat errors with require_model and require_labels
* update thinc version with recent bugfix
2019-07-10 19:39:38 +02:00
Ines Montani
f2ea3e3ea2
Merge branch 'master' into feature/nel-wiki
2019-07-09 21:57:47 +02:00
Ines Montani
547464609d
Remove merge_subtokens from parser postprocessing for now
2019-07-09 21:50:30 +02:00
svlandeg
668b17ea4a
deuglify kb deserializer
2019-07-03 15:00:42 +02:00
svlandeg
8840d4b1b3
fix for context encoder optimizer
2019-07-03 13:35:36 +02:00
svlandeg
2d2dea9924
experiment with adding NER types to the feature vector
2019-06-29 14:52:36 +02:00
svlandeg
c664f58246
adding prior probability as feature in the model
2019-06-28 16:22:58 +02:00
svlandeg
68a0662019
context encoder with Tok2Vec + linking model instead of cosine
2019-06-28 08:29:31 +02:00
Ines Montani
37f744ca00
Auto-format [ci skip]
2019-06-26 14:48:09 +02:00
svlandeg
1de61f68d6
improve speed of prediction loop
2019-06-26 13:53:10 +02:00
svlandeg
58a5b40ef6
clean up duplicate code
2019-06-24 15:19:58 +02:00
svlandeg
b58bace84b
small fixes
2019-06-24 10:55:04 +02:00
svlandeg
cc9ae28a52
custom error and warning messages
2019-06-19 12:35:26 +02:00
svlandeg
791327e3c5
Merge remote-tracking branch 'upstream/master' into feature/nel-wiki
2019-06-19 09:44:05 +02:00
svlandeg
a31648d28b
further code cleanup
2019-06-19 09:15:43 +02:00
svlandeg
478305cd3f
small tweaks and documentation
2019-06-18 18:38:09 +02:00
svlandeg
0d177c1146
clean up code, remove old code, move to bin
2019-06-18 13:20:40 +02:00
svlandeg
ffae7d3555
sentence encoder only (removing article/mention encoder)
2019-06-18 00:05:47 +02:00
svlandeg
b312f2d0e7
redo training data to be independent of KB and entity-level instead of doc-level
2019-06-14 15:55:26 +02:00
svlandeg
78dd3e11da
write entity linking pipe to file and keep vocab consistent between kb and nlp
2019-06-13 16:25:39 +02:00
svlandeg
b12001f368
small fixes
2019-06-12 22:05:53 +02:00
svlandeg
6521cfa132
speeding up training
2019-06-12 13:37:05 +02:00
svlandeg
fe1ed432ef
eval on dev set, varying combo's of prior and context scores
2019-06-11 11:40:58 +02:00
svlandeg
83dc7b46fd
first tests with EL pipe
2019-06-10 21:25:26 +02:00
Matthew Honnibal
a931d72459
Add merge_subtokens as parser post-process. Re #3830
2019-06-07 20:40:41 +02:00
svlandeg
7de1ee69b8
training loop in proper pipe format
2019-06-07 15:55:10 +02:00
svlandeg
0486ccabfd
introduce goldparse.links
2019-06-07 13:54:45 +02:00
svlandeg
a5c061f506
storing NEL training data in GoldParse objects
2019-06-07 12:58:42 +02:00
svlandeg
61f0e2af65
code cleanup
2019-06-06 20:22:14 +02:00
svlandeg
5c723c32c3
entity vectors in the KB + serialization of them
2019-06-05 18:29:18 +02:00
svlandeg
9abbd0899f
separate entity encoder to get 64D descriptions
2019-06-05 00:09:46 +02:00
svlandeg
fb37cdb2d3
implementing el pipe in pipes.pyx (not tested yet)
2019-06-03 21:32:54 +02:00
svlandeg
dd691d0053
debugging
2019-05-17 17:44:11 +02:00
Sofie
a4a6bfa4e1
Merge branch 'master' into feature/el-framework
2019-03-26 11:00:02 +01:00
svlandeg
8814b9010d
entity as one field instead of both ID and name
2019-03-25 18:10:41 +01:00
Matthew Honnibal
6c783f8045
Bug fixes and options for TextCategorizer ( #3472 )
...
* Fix code for bag-of-words feature extraction
The _ml.py module had a redundant copy of a function to extract unigram
bag-of-words features, except one had a bug that set values to 0.
Another function allowed extraction of bigram features. Replace all three
with a new function that supports arbitrary ngram sizes and also allows
control of which attribute is used (e.g. ORTH, LOWER, etc).
* Support 'bow' architecture for TextCategorizer
This allows efficient ngram bag-of-words models, which are better when
the classifier needs to run quickly, especially when the texts are long.
Pass architecture="bow" to use it. The extra arguments ngram_size and
attr are also available, e.g. ngram_size=2 means unigram and bigram
features will be extracted.
* Fix size limits in train_textcat example
* Explain architectures better in docs
2019-03-23 16:44:44 +01:00
Ines Montani
06bf130890
💫 Add better and serializable sentencizer ( #3471 )
...
* Add better serializable sentencizer component
* Replace default factory
* Add tests
* Tidy up
* Pass test
* Update docs
2019-03-23 15:45:02 +01:00
svlandeg
5318ce88fa
'entity_linker' instead of 'el'
2019-03-22 13:55:10 +01:00
svlandeg
1ee0e78fd7
select candidate with highest prior probabiity
2019-03-22 11:36:45 +01:00
svlandeg
c593607ce2
minimal EL pipe
2019-03-22 11:36:45 +01:00
svlandeg
735fc2a735
annotate kb_id through ents in doc
2019-03-22 11:36:44 +01:00
svlandeg
d849eb2455
adding kb_id as field to token, el as nlp pipeline component
2019-03-22 11:34:46 +01:00
Ines Montani
cb5dbfa63a
Tidy up references to n_threads and fix default
2019-03-15 16:24:26 +01:00