Commit Graph

12287 Commits

Author SHA1 Message Date
Adriane Boyd
ad15499b3b
Fix get_loss for values outside of labels in senter (#5730)
* Fix get_loss for None alignments in senter

When converting the `sent_start` values back to `SentenceRecognizer`
labels, handle `None` alignments.

* Handle SENT_START as -1

Handle SENT_START as -1 (or -1 converted to uint64) by treating any
values other than 1 the same as 0 in `SentenceRecognizer.get_loss`.
2020-07-09 01:41:58 +02:00
Matthw Honnibal
9b49787f35 Update NER config. Getting 84.8 2020-07-08 21:38:01 +02:00
Matthw Honnibal
1b20ffac38 batch_by_words by default 2020-07-08 21:37:06 +02:00
Matthw Honnibal
93e50da46a Remove auto 'set_annotation' in training to address GPU memory 2020-07-08 21:36:51 +02:00
Matthw Honnibal
fb8a5967c1 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2020-07-08 15:27:50 +02:00
Ines Montani
0a3d41bb1d
Deprecat model shortcuts and simplify download (#5722) 2020-07-08 14:00:07 +02:00
Adriane Boyd
c9f0f75778
Update get_loss for senter and morphologizer (#5724)
* Update get_loss for senter

Update `SentenceRecognizer.get_loss` to keep it similar to `Tagger`.

* Update get_loss for morphologizer

Update `Morphologizer.get_loss` to keep it similar to `Tagger`.
2020-07-08 13:59:28 +02:00
Ines Montani
9ae4040183 Update API docs 2020-07-08 13:34:35 +02:00
svlandeg
c94279ac1b remove tensors, fix predict, get_loss and set_annotations 2020-07-08 13:11:54 +02:00
svlandeg
90b100c39f remove component.Model, update constructor, losses is return value of update 2020-07-08 12:14:30 +02:00
Ines Montani
3d83721551
Merge pull request #5723 from gandersen101/fix-spaczz-universe-typo 2020-07-08 11:35:40 +02:00
Matthw Honnibal
ca989f4cc4 Improve cutting logic in parser 2020-07-08 11:27:54 +02:00
Matthw Honnibal
42e1109def Support option to not batch by number of words 2020-07-08 11:26:54 +02:00
gandersen101
893133873d Fix quote issue in spaczz universe.json 2020-07-07 19:16:28 -05:00
Ines Montani
109849bd31 Fix and update universe.json [ci skip] 2020-07-07 21:12:28 +02:00
gandersen101
9097549227
Adding spaczz package to universe.json (#5717)
* Adding spaczz package to universe.json

* Adding contributor agreement.
2020-07-07 20:55:24 +02:00
Jonathan Besomi
546f3d10d4
Add texthero to universe.json (#5716)
* Add texthero to universe.json

* Add spaCy contributor Agreement
2020-07-07 20:54:22 +02:00
Ines Montani
8cb7f9ccff
Improve assets and DVC handling (#5719)
* Improve assets and DVC handling

* Remove outdated comment [ci skip]
2020-07-07 20:51:50 +02:00
Ines Montani
2298e129e6 Update example and training docs 2020-07-07 20:30:12 +02:00
svlandeg
2b60e894cb fix component constructors, update, begin_training, reference to GoldParse 2020-07-07 19:17:19 +02:00
Sofie Van Landeghem
a39a110c4e
Few more Example unit tests (#5720)
* small fixes in Example, UX

* add gold tests for aligned_spans and get_aligned_parse

* sentencizer unnecessary
2020-07-07 18:46:00 +02:00
Matthw Honnibal
433dc3c9c9 Simplify PrecomputableAffine slightly 2020-07-07 17:22:47 +02:00
Matthw Honnibal
a4164f67ca Don't normalize gradients 2020-07-07 17:21:58 +02:00
Matthw Honnibal
8177f25b6c Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2020-07-07 17:21:10 +02:00
svlandeg
14a796e3f9 add Example API with examples of Example usage 2020-07-07 14:46:41 +02:00
Ines Montani
fa00a85828
Merge pull request #5715 from explosion/chore/tidy-regression-tests 2020-07-07 11:22:07 +02:00
Matthw Honnibal
d1fd3438c3 Add dropout to parser hidden layer 2020-07-07 01:38:15 +02:00
Ines Montani
bb3ee38cf9 Update WIP 2020-07-06 22:22:37 +02:00
Ines Montani
44da24ddd0 Update doc.md 2020-07-06 18:17:00 +02:00
Ines Montani
44790c1c32 Update docs and add keyword-only tag 2020-07-06 18:14:57 +02:00
Matthw Honnibal
1eb1654941 Update configs 2020-07-06 17:51:37 +02:00
Matthw Honnibal
f25761e513 Dont randomize cuts in parser 2020-07-06 17:51:25 +02:00
Matthw Honnibal
709fc5e4ad Clarify dropout and seed in Tok2Vec 2020-07-06 17:50:21 +02:00
Matthew Honnibal
19d42f42de Set version to v3.0.0a2 2020-07-06 17:43:12 +02:00
Matthew Honnibal
cc477be952
Improve gold-standard alignment (#5711)
* Remove previous alignment

* Implement better alignment, using ragged data structure

* Use pytokenizations for alignment

* Fixes

* Fixes

* Fix overlapping entities in alignment

* Fix align split_sents

* Update test

* Commit align.py

* Try to appease setuptools

* Fix flake8

* use realistic entities for testing

* Update tests for better alignment

* Improve alignment heuristic

Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>
2020-07-06 17:39:31 +02:00
Mike Izbicki
7a2ca00794
fix bug in Korean language, resulting in 100x speedup by reducing overhead of mecab (#5701)
* speed up Korean nlp 100x by stopping mecab from reloading on each doc

* add contributor agreement

* rename variables to improve code readability
2020-07-06 17:03:33 +02:00
Ines Montani
b6deef80f8 Fix class to pickling works as expected 2020-07-06 16:43:45 +02:00
Ines Montani
a35236e5f0 Update v3 docs WIP [ci skip] 2020-07-06 15:57:44 +02:00
Ines Montani
fa261d09e8 Add alternative CLI option 2020-07-06 15:57:38 +02:00
Adriane Boyd
c67fc6aa5b
Make docs_to_json backwards-compatible with v2 (#5714)
* In `spacy convert -t json` output the JSON docs wrapped in a list

* Add back token-level `ner` alongside the doc-level `entities`
2020-07-06 14:15:00 +02:00
Ines Montani
5b7b2a498d Tidy up and merge regression tests 2020-07-06 14:05:59 +02:00
Ines Montani
412dbb1f38
Remove dead and/or deprecated code (#5710)
* Remove dead and/or deprecated code

* Remove n_threads

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2020-07-06 13:06:25 +02:00
Sofie Van Landeghem
fcbf899b08
Feature/example only (#5707)
* remove _convert_examples

* fix test_gold, raise TypeError if tuples are used instead of Example's

* throwing proper errors when the wrong type of objects are passed

* fix deprectated format in tests

* fix deprectated format in parser tests

* fix tests for NEL, morph, senter, tagger, textcat

* update regression tests with new Example format

* use make_doc

* more fixes to nlp.update calls

* few more small fixes for rehearse and evaluate

* only import ml_datasets if really necessary
2020-07-06 13:02:36 +02:00
Ines Montani
63247cbe87 Update v3 docs [ci skip] 2020-07-05 16:11:16 +02:00
graue70
9860b8399e
Fix typo in test function docstring (#5696) 2020-07-05 15:49:06 +02:00
Matthew Honnibal
3e78e82a83
Experimental character-based pretraining (#5700)
* Use cosine loss in Cloze multitask

* Fix char_embed for gpu

* Call resume_training for base model in train CLI

* Fix bilstm_depth default in pretrain command

* Implement character-based pretraining objective

* Use chars loss in ClozeMultitask

* Add method to decode predicted characters

* Fix number characters

* Rescale gradients for mlm

* Fix char embed+vectors in ml

* Fix pipes

* Fix pretrain args

* Move get_characters_loss

* Fix import

* Fix import

* Mention characters loss option in pretrain

* Remove broken 'self attention' option in pretrain

* Revert "Remove broken 'self attention' option in pretrain"

This reverts commit 56b820f6af.

* Document 'characters' objective of pretrain
2020-07-05 15:48:39 +02:00
Matthw Honnibal
3f6f087113 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2020-07-04 23:52:12 +02:00
Matthw Honnibal
5642507823 Fix has_unknown_spaces in Doc.copy 2020-07-04 23:52:02 +02:00
Matthw Honnibal
8870a6ded7 Specify seeds in HashEmbed 2020-07-04 23:51:49 +02:00
Ines Montani
dc8c9d912f Update docs [ci skip] 2020-07-04 16:47:24 +02:00