svlandeg
21176517a7
have gold.links correspond exactly to doc.ents
2019-07-19 12:36:15 +02:00
svlandeg
e1213eaf6a
use original gold object in get_loss function
2019-07-18 13:35:10 +02:00
svlandeg
ec55d2fccd
filter training data beforehand (+black formatting)
2019-07-18 10:22:24 +02:00
svlandeg
a63d15a142
code cleanup
2019-07-15 17:36:43 +02:00
svlandeg
60f299374f
set default context width
2019-07-15 12:03:09 +02:00
Sofie Van Landeghem
c4c21cb428
more friendly textcat errors ( #3946 )
...
* more friendly textcat errors with require_model and require_labels
* update thinc version with recent bugfix
2019-07-10 19:39:38 +02:00
Ines Montani
f2ea3e3ea2
Merge branch 'master' into feature/nel-wiki
2019-07-09 21:57:47 +02:00
Ines Montani
547464609d
Remove merge_subtokens from parser postprocessing for now
2019-07-09 21:50:30 +02:00
svlandeg
668b17ea4a
deuglify kb deserializer
2019-07-03 15:00:42 +02:00
svlandeg
8840d4b1b3
fix for context encoder optimizer
2019-07-03 13:35:36 +02:00
svlandeg
2d2dea9924
experiment with adding NER types to the feature vector
2019-06-29 14:52:36 +02:00
svlandeg
c664f58246
adding prior probability as feature in the model
2019-06-28 16:22:58 +02:00
svlandeg
68a0662019
context encoder with Tok2Vec + linking model instead of cosine
2019-06-28 08:29:31 +02:00
Ines Montani
37f744ca00
Auto-format [ci skip]
2019-06-26 14:48:09 +02:00
svlandeg
1de61f68d6
improve speed of prediction loop
2019-06-26 13:53:10 +02:00
svlandeg
58a5b40ef6
clean up duplicate code
2019-06-24 15:19:58 +02:00
svlandeg
b58bace84b
small fixes
2019-06-24 10:55:04 +02:00
svlandeg
cc9ae28a52
custom error and warning messages
2019-06-19 12:35:26 +02:00
svlandeg
791327e3c5
Merge remote-tracking branch 'upstream/master' into feature/nel-wiki
2019-06-19 09:44:05 +02:00
svlandeg
a31648d28b
further code cleanup
2019-06-19 09:15:43 +02:00
svlandeg
478305cd3f
small tweaks and documentation
2019-06-18 18:38:09 +02:00
svlandeg
0d177c1146
clean up code, remove old code, move to bin
2019-06-18 13:20:40 +02:00
svlandeg
ffae7d3555
sentence encoder only (removing article/mention encoder)
2019-06-18 00:05:47 +02:00
svlandeg
b312f2d0e7
redo training data to be independent of KB and entity-level instead of doc-level
2019-06-14 15:55:26 +02:00
svlandeg
78dd3e11da
write entity linking pipe to file and keep vocab consistent between kb and nlp
2019-06-13 16:25:39 +02:00
svlandeg
b12001f368
small fixes
2019-06-12 22:05:53 +02:00
svlandeg
6521cfa132
speeding up training
2019-06-12 13:37:05 +02:00
svlandeg
fe1ed432ef
eval on dev set, varying combo's of prior and context scores
2019-06-11 11:40:58 +02:00
svlandeg
83dc7b46fd
first tests with EL pipe
2019-06-10 21:25:26 +02:00
Matthew Honnibal
a931d72459
Add merge_subtokens as parser post-process. Re #3830
2019-06-07 20:40:41 +02:00
svlandeg
7de1ee69b8
training loop in proper pipe format
2019-06-07 15:55:10 +02:00
svlandeg
0486ccabfd
introduce goldparse.links
2019-06-07 13:54:45 +02:00
svlandeg
a5c061f506
storing NEL training data in GoldParse objects
2019-06-07 12:58:42 +02:00
svlandeg
61f0e2af65
code cleanup
2019-06-06 20:22:14 +02:00
svlandeg
5c723c32c3
entity vectors in the KB + serialization of them
2019-06-05 18:29:18 +02:00
svlandeg
9abbd0899f
separate entity encoder to get 64D descriptions
2019-06-05 00:09:46 +02:00
svlandeg
fb37cdb2d3
implementing el pipe in pipes.pyx (not tested yet)
2019-06-03 21:32:54 +02:00
svlandeg
dd691d0053
debugging
2019-05-17 17:44:11 +02:00
Sofie
a4a6bfa4e1
Merge branch 'master' into feature/el-framework
2019-03-26 11:00:02 +01:00
svlandeg
8814b9010d
entity as one field instead of both ID and name
2019-03-25 18:10:41 +01:00
Matthew Honnibal
6c783f8045
Bug fixes and options for TextCategorizer ( #3472 )
...
* Fix code for bag-of-words feature extraction
The _ml.py module had a redundant copy of a function to extract unigram
bag-of-words features, except one had a bug that set values to 0.
Another function allowed extraction of bigram features. Replace all three
with a new function that supports arbitrary ngram sizes and also allows
control of which attribute is used (e.g. ORTH, LOWER, etc).
* Support 'bow' architecture for TextCategorizer
This allows efficient ngram bag-of-words models, which are better when
the classifier needs to run quickly, especially when the texts are long.
Pass architecture="bow" to use it. The extra arguments ngram_size and
attr are also available, e.g. ngram_size=2 means unigram and bigram
features will be extracted.
* Fix size limits in train_textcat example
* Explain architectures better in docs
2019-03-23 16:44:44 +01:00
Ines Montani
06bf130890
💫 Add better and serializable sentencizer ( #3471 )
...
* Add better serializable sentencizer component
* Replace default factory
* Add tests
* Tidy up
* Pass test
* Update docs
2019-03-23 15:45:02 +01:00
svlandeg
5318ce88fa
'entity_linker' instead of 'el'
2019-03-22 13:55:10 +01:00
svlandeg
1ee0e78fd7
select candidate with highest prior probabiity
2019-03-22 11:36:45 +01:00
svlandeg
c593607ce2
minimal EL pipe
2019-03-22 11:36:45 +01:00
svlandeg
735fc2a735
annotate kb_id through ents in doc
2019-03-22 11:36:44 +01:00
svlandeg
d849eb2455
adding kb_id as field to token, el as nlp pipeline component
2019-03-22 11:34:46 +01:00
Ines Montani
278e9d2eb0
Merge branch 'master' into feature/lemmatizer
2019-03-16 13:44:22 +01:00
Ines Montani
cb5dbfa63a
Tidy up references to n_threads and fix default
2019-03-15 16:24:26 +01:00
Ines Montani
7ba3a5d95c
💫 Make serialization methods consistent ( #3385 )
...
* Make serialization methods consistent
exclude keyword argument instead of random named keyword arguments and deprecation handling
* Update docs and add section on serialization fields
2019-03-10 19:16:45 +01:00
Matthew Honnibal
0f12082465
Refactor morphologizer
2019-03-09 22:54:59 +00:00
Matthew Honnibal
cc2b2dba14
Neaten set_morphology option on Tagger
2019-03-08 19:16:02 +01:00
Matthew Honnibal
afa227e25b
Fix setter
2019-03-08 19:10:01 +01:00
Matthew Honnibal
b27bd42613
Fix compile error
2019-03-08 19:06:02 +01:00
Matthew Honnibal
c91577db02
Add set_morphology cfg option for Tagger
2019-03-08 19:03:17 +01:00
Ines Montani
296446a1c8
Tidy up and improve docs and docstrings ( #3370 )
...
<!--- Provide a general summary of your changes in the title. -->
## Description
* tidy up and adjust Cython code to code style
* improve docstrings and make calling `help()` nicer
* add URLs to new docs pages to docstrings wherever possible, mostly to user-facing objects
* fix various typos and inconsistencies in docs
### Types of change
enhancement, docs
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-03-08 11:42:26 +01:00
Matthew Honnibal
6b0008afc6
Clean up TextCategorizer slightly
2019-02-23 12:28:06 +01:00
Matthew Honnibal
ce1e4eace2
Default to former TextCategorizer model
...
* Keep TextCategorizer default model same as v2.0
* Add option 'architecture' that allows "simple_cnn" to switch to
simpler model.
* Add option exclusive_classes, defaulting to False. If set to True,
the model treats classes as mutually exclusive, i.e. only one class can
be true per instance.
2019-02-23 11:55:16 +01:00
Matthew Honnibal
a137e8b418
Fix Pipe.to_bytes() when model uninitialized
...
Closes #3289
2019-02-21 09:42:02 +01:00
Ines Montani
f146121092
💫 Make handling of [Pipe].labels consistent ( #3273 )
...
* Make handling of [Pipe].labels consistent
* Un-xfail passing test
* Update spacy/pipeline/pipes.pyx
Co-Authored-By: ines <ines@ines.io>
* Update spacy/pipeline/pipes.pyx
Co-Authored-By: ines <ines@ines.io>
* Update spacy/tests/pipeline/test_pipe_methods.py
Co-Authored-By: ines <ines@ines.io>
* Update spacy/pipeline/pipes.pyx
Co-Authored-By: ines <ines@ines.io>
* Move error message to spacy.errors
* Fix textcat labels and test
* Make EntityRuler.labels return tuple as well
2019-02-15 06:03:19 +11:00
Ines Montani
a9f8d17632
💫 Break up large pipeline.pyx ( #3246 )
...
* Break up large pipeline.pyx
* Merge some components back together
* Fix typo
2019-02-10 12:14:51 +01:00