Commit Graph

58 Commits

Author SHA1 Message Date
Edward
014da12f1d
Dont add tok2vec when efficiency textcat (#9502) 2021-10-20 17:30:19 +02:00
Sofie Van Landeghem
3fd3531e12
Docs for new spacy-trf architectures (#8954)
* use TransformerModel.v2 in quickstart

* update docs for new transformer architectures

* bump spacy_transformers to 1.1.0

* Add new arguments spacy-transformers.TransformerModel.v3

* Mention that mixed-precision support is experimental

* Describe delta transformers.Tok2VecTransformer versions

* add dot

* add dot, again

* Update some more TransformerModel references v2 -> v3

* Add mixed-precision options to the training quickstart

Disable mixed-precision training/prediction by default.

* Update setup.cfg

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Apply suggestions from code review

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update website/docs/usage/embeddings-transformers.md

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

Co-authored-by: Daniël de Kok <me@danieldk.eu>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2021-10-18 14:15:06 +02:00
Adriane Boyd
8448c7dbc5
Update da trf recommendation (#8921)
Update the da trf recommendation to the same model used in the
pretrained pipelines.
2021-08-12 13:54:02 +02:00
Adriane Boyd
5aa099505f Preserve paths.vectors/initialize.vectors setting in quickstart template 2021-06-23 11:07:14 +02:00
Sofie Van Landeghem
e796aab4b3
Resizable textcat (#7862)
* implement textcat resizing for TextCatCNN

* resizing textcat in-place

* simplify code

* ensure predictions for old textcat labels remain the same after resizing (WIP)

* fix for softmax

* store softmax as attr

* fix ensemble weight copy and cleanup

* restructure slightly

* adjust documentation, update tests and quickstart templates to use latest versions

* extend unit test slightly

* revert unnecessary edits

* fix typo

* ensemble architecture won't be resizable for now

* use resizable layer (WIP)

* revert using resizable layer

* resizable container while avoid shape inference trouble

* cleanup

* ensure model continues training after resizing

* use fill_b parameter

* use fill_defaults

* resize_layer callback

* format

* bump thinc to 8.0.4

* bump spacy-legacy to 3.0.6
2021-06-16 11:45:00 +02:00
Adriane Boyd
cd6bd91c3a
Switch default train corpus max_length to 0 in quickstart (#8142)
The behavior of `spacy.Corpus.v1` is unexpected enough for `max_length
!= 0` that `0` is a better default for users creating a new config with
the quickstart.

If not, documents are skipped, sometimes the entire corpus is skipped,
and sometimes documents are (quite unexpectedly for your average user)
split into sentences.
2021-05-20 14:48:09 +02:00
Adriane Boyd
d2bdaa7823
Replace negative rows with 0 in StaticVectors (#7674)
* Replace negative rows with 0 in StaticVectors

Replace negative row indices with 0-vectors in `StaticVectors`.

* Increase versions related to StaticVectors

* Increase versions of all architctures and layers related to
`StaticVectors`
* Improve efficiency of 0-vector operations

Parallel `spacy-legacy` PR: https://github.com/explosion/spacy-legacy/pull/5

* Update config defaults to new versions

* Update docs
2021-04-22 18:04:15 +10:00
Adriane Boyd
8a4200d4e9 Omit unused tok2vec/transformer components
Omit unused tok2vec/transformer components in quickstart template.
2021-03-02 15:53:30 +01:00
Adriane Boyd
ee7bb0b393 Fix formatting in bg/bn quickstart recs 2021-02-26 17:08:37 +01:00
Ines Montani
1e3a326e53 Change Dutch transformer recommendation [ci skip]
https://github.com/explosion/spaCy/discussions/6529#discussioncomment-366620
2021-02-14 15:30:16 +11:00
Adriane Boyd
0ee2ae86bf Update trf quickstart recommendations
Add/update trf recommendations for Bengali, Hindi, Sinhala, and Tamil
based on #7044.
2021-02-12 15:55:17 +01:00
Adriane Boyd
35a863cd27 Remove nlp.tokenizer from quickstart template
Remove `nlp.tokenizer` from quickstart template so that the default
language-specific tokenizer settings are filled instead.
2021-02-01 11:20:12 +01:00
Ines Montani
78d6ff4dd4 Update quickstart recommendations 2021-01-28 11:14:49 +11:00
Ines Montani
ec5f55aa5b
Update config generation defaults and transformers (#6832) 2021-01-27 23:56:33 +11:00
Sofie Van Landeghem
75d9019343
Fix types of Tok2Vec encoding architectures (#6442)
* fix TorchBiLSTMEncoder documentation

* ensure the types of the encoding Tok2vec layers are correct

* update references from v1 to v2 for the new architectures
2021-01-07 16:39:27 +11:00
Sofie Van Landeghem
afc5714d32
multi-label textcat component (#6474)
* multi-label textcat component

* formatting

* fix comment

* cleanup

* fix from #6481

* random edit to push the tests

* add explicit error when textcat is called with multi-label gold data

* fix error nr

* small fix
2021-01-06 13:07:14 +11:00
Sofie Van Landeghem
282a3b49ea
Fix parser resizing when there is no upper layer (#6460)
* allow resizing of the parser model even when upper=False

* update from spacy.TransitionBasedParser.v1 to v2

* bugfix
2020-12-18 18:56:57 +08:00
Adriane Boyd
fa8fa474a3 Add nlp.batch_size setting
Add a default `batch_size` setting for `Language.pipe` and
`Language.evaluate` as `nlp.batch_size`.
2020-12-09 09:13:26 +01:00
Sofie Van Landeghem
a0c899a0ff
Fix textcat + transformer architecture (#6371)
* add pooling to textcat TransformerListener

* maybe_get_dim in case it's null
2020-11-10 20:14:47 +08:00
Sofie Van Landeghem
75a202ce65
TextCat updates and fixes (#6263)
* small fix in example imports

* throw error when train_corpus or dev_corpus is not a string

* small fix in custom logger example

* limit macro_auc to labels with 2 annotations

* fix typo

* also create parents of output_dir if need be

* update documentation of textcat scores

* refactor TextCatEnsemble

* fix tests for new AUC definition

* bump to 3.0.0a42

* update docs

* rename to spacy.TextCatEnsemble.v2

* spacy.TextCatEnsemble.v1 in legacy

* cleanup

* small fix

* update to 3.0.0rc2

* fix import that got lost in merge

* cursed IDE

* fix two typos
2020-10-18 14:50:41 +02:00
Adriane Boyd
c8d04b79e2 Sort and add vectors for langs without transformers 2020-10-16 08:25:16 +02:00
Adriane Boyd
2fbd43c603 Use core lg models as vectors models in quickstart 2020-10-16 08:17:53 +02:00
Ines Montani
1f49300862 Update transformer recommendations [ci skip] 2020-10-13 15:41:17 +02:00
Matthew Honnibal
b7e01d2024 Fix quickstart 2020-10-05 21:21:30 +02:00
Matthew Honnibal
ff8b980775 Upd quickstart template 2020-10-05 21:19:41 +02:00
Adriane Boyd
22158dc24a Add morphologizer to quickstart template 2020-10-02 15:06:16 +02:00
Ines Montani
fe3f111c37
Merge pull request #6168 from explosion/fix/default-corpus-values 2020-09-30 00:24:02 +02:00
Ines Montani
ae51843468 Remove augmenter from jinja template [ci skip] 2020-09-29 23:08:50 +02:00
Ines Montani
1aeef3bfbb Make corpus paths default to None and improve errors 2020-09-29 22:33:46 +02:00
Ines Montani
d3c63b7965 Merge branch 'develop' into feature/prepare 2020-09-29 20:53:05 +02:00
Ines Montani
534e1ef498 Fix template 2020-09-29 17:02:55 +02:00
Ines Montani
1590de11b1 Update config 2020-09-28 12:05:23 +02:00
Matthew Honnibal
a976da168c
Support data augmentation in Corpus (#6155)
* Support data augmentation in Corpus

* Note initial docs for data augmentation

* Add augmenter to quickstart

* Fix flake8

* Format

* Fix test

* Update spacy/tests/training/test_training.py

* Improve data augmentation arguments

* Update templates

* Move randomization out into caller

* Refactor

* Update spacy/training/augment.py

* Update spacy/tests/training/test_training.py

* Fix augment

* Fix test
2020-09-28 03:03:27 +02:00
Ines Montani
ae51f580c1 Fix handling of score_weights 2020-09-24 10:27:33 +02:00
svlandeg
35dbc63578 Merge remote-tracking branch 'upstream/develop' into fix/nr_features
# Conflicts:
#	spacy/ml/models/parser.py
#	spacy/tests/serialize/test_serialize_config.py
#	website/docs/api/architectures.md
2020-09-23 17:01:13 +02:00
svlandeg
dd2292793f 'parser' instead of 'deps' for state_type 2020-09-23 16:53:49 +02:00
svlandeg
6c85fab316 state_type and extra_state_tokens instead of nr_feature_tokens 2020-09-23 13:35:09 +02:00
Ines Montani
7745d77a38 Fix whitespace in template [ci skip] 2020-09-23 13:21:42 +02:00
Ines Montani
6ca06cb62c Update docs and formatting [ci skip] 2020-09-23 10:14:27 +02:00
svlandeg
556f3e4652 add pooling to NEL's TransformerListener 2020-09-23 09:24:28 +02:00
svlandeg
085a1c8e2b add no_output_layer to TextCatBOW config 2020-09-22 12:06:40 +02:00
svlandeg
e931f4d757 add textcat score 2020-09-22 10:56:43 +02:00
svlandeg
396b33257f add entity_linker to jinja template 2020-09-22 10:40:05 +02:00
svlandeg
135de82a2d add textcat to quickstart 2020-09-22 10:22:06 +02:00
Ines Montani
554c9a2497 Update docs [ci skip] 2020-09-20 12:30:53 +02:00
Sofie Van Landeghem
39872de1f6
Introducing the gpu_allocator (#6091)
* rename 'use_pytorch_for_gpu_memory' to 'gpu_allocator'

* --code instead of --code-path

* update documentation

* avoid querying the "system" section directly

* add explanation of gpu_allocator to TF/PyTorch section in docs

* fix typo

* fix typo 2

* use set_gpu_allocator from thinc 8.0.0a34

* default null instead of empty string
2020-09-19 01:17:02 +02:00
svlandeg
0c35885751 generalize corpora, dot notation for dev and train corpus 2020-09-17 11:38:59 +02:00
svlandeg
51fa929f47 rewrite train_corpus to corpus.train in config 2020-09-15 21:58:04 +02:00
Matthew Honnibal
4b7abaafdb Fix learn rate for non-transformer 2020-09-04 21:22:50 +02:00
Ines Montani
23b7d9cfa3 Prefix span getters 2020-09-03 17:37:06 +02:00