Commit Graph

189 Commits

Author SHA1 Message Date
Sofie Van Landeghem
cd70c3cb79
Fixing pretrain (#7342)
* initialize NLP with train corpus

* add more pretraining tests

* more tests

* function to fetch tok2vec layer for pretraining

* clarify parameter name

* test different objectives

* formatting

* fix check for static vectors when using vectors objective

* clarify docs

* logger statement

* fix init_tok2vec and proc.initialize order

* test training after pretraining

* add init_config tests for pretraining

* pop pretraining block to avoid config validation errors

* custom errors
2021-03-09 14:01:13 +11:00
svlandeg
d900c55061 consistently use registry as callable 2021-03-02 17:56:28 +01:00
René Octavio Queiroz Dias
59271e887a
fix: TransformerListener with TextCatEnsemble (#6951)
* bug: Regression test
Issue #6946

* fix: Fix issue #6946

* chore: Remove regression test
2021-02-06 13:44:51 +01:00
Matthew Honnibal
ffc371350a
Avoid assuming encode.get_dim('nO') is set in tok2vec (#6800) 2021-01-24 14:37:33 +11:00
Sofie Van Landeghem
c8761b0e6e
rewrite Maxout layer as separate layers to avoid shape inference trouble (#6760) 2021-01-19 07:37:17 +08:00
Adriane Boyd
26c34ab8b0
Fix parser resizing for cupy (#6758) 2021-01-18 20:43:15 +01:00
Matthew Honnibal
c2a18e4fa3 Update textcat ensemble model 2021-01-19 02:53:02 +11:00
Ines Montani
a203e3dbb8 Support spacy-legacy via the registry 2021-01-15 21:42:40 +11:00
Ines Montani
b0b743597c Tidy up and auto-format 2021-01-15 11:57:36 +11:00
Sofie Van Landeghem
75d9019343
Fix types of Tok2Vec encoding architectures (#6442)
* fix TorchBiLSTMEncoder documentation

* ensure the types of the encoding Tok2vec layers are correct

* update references from v1 to v2 for the new architectures
2021-01-07 16:39:27 +11:00
Sofie Van Landeghem
3983bc6b1e
Fix Transformer width in TextCatEnsemble (#6431)
* add convenience method to determine tok2vec width in a model

* fix transformer tok2vec dimensions in TextCatEnsemble architecture

* init function should not be nested to avoid pickle issues
2021-01-06 12:44:04 +01:00
Ines Montani
991669c934 Tidy up and auto-format 2021-01-05 13:41:53 +11:00
Sofie Van Landeghem
282a3b49ea
Fix parser resizing when there is no upper layer (#6460)
* allow resizing of the parser model even when upper=False

* update from spacy.TransitionBasedParser.v1 to v2

* bugfix
2020-12-18 18:56:57 +08:00
Sofie Van Landeghem
cfc72c2995
Bugfix multi-label textcat reproducibility (#6481)
* add test for multi-label textcat reproducibility

* remove positive_label

* fix lengths dtype

* fix comments

* remove comment that we should not have forgotten :-)
2020-12-09 06:29:15 +08:00
Sofie Van Landeghem
de108ed3e8
Add specific error when StaticVectors can't read the vectors data (#6450) 2020-12-09 06:16:07 +08:00
Sofie Van Landeghem
f98a04434a
pretrain architectures (#6451)
* define new architectures for the pretraining objective

* add loss function as attr of the omdel

* cleanup

* cleanup

* shorten name

* fix typo

* remove unused error
2020-12-08 14:41:03 +08:00
Sofie Van Landeghem
a0c899a0ff
Fix textcat + transformer architecture (#6371)
* add pooling to textcat TransformerListener

* maybe_get_dim in case it's null
2020-11-10 20:14:47 +08:00
Sofie Van Landeghem
75a202ce65
TextCat updates and fixes (#6263)
* small fix in example imports

* throw error when train_corpus or dev_corpus is not a string

* small fix in custom logger example

* limit macro_auc to labels with 2 annotations

* fix typo

* also create parents of output_dir if need be

* update documentation of textcat scores

* refactor TextCatEnsemble

* fix tests for new AUC definition

* bump to 3.0.0a42

* update docs

* rename to spacy.TextCatEnsemble.v2

* spacy.TextCatEnsemble.v1 in legacy

* cleanup

* small fix

* update to 3.0.0rc2

* fix import that got lost in merge

* cursed IDE

* fix two typos
2020-10-18 14:50:41 +02:00
Sofie Van Landeghem
f8a1c1afd6
avoid dropout at runtime (#6247) 2020-10-13 14:39:59 +02:00
svlandeg
40276fd3be update NEL docs after latest refactor 2020-10-12 11:41:27 +02:00
svlandeg
08cb085f6c Merge remote-tracking branch 'upstream/develop' into fix/various 2020-10-09 17:01:27 +02:00
svlandeg
040c7c0541 fix get_dim calls in build_simple_cnn_text_classifier 2020-10-09 15:40:58 +02:00
svlandeg
853edace37 fix MultiHashEmbed example in documentation 2020-10-09 14:11:06 +02:00
Adriane Boyd
39aabf50ab Also rename to include_static_vectors in CharEmbed 2020-10-09 11:54:48 +02:00
Matthew Honnibal
cfb9770a94
Fix empty input into StaticVectors layer (#6211)
* Add test for empty doc(s)

* Fix empty check in staticvectors

* Remove xfail

* Update spacy/ml/staticvectors.py
2020-10-06 14:15:41 +02:00
Ines Montani
1a554bdcb1 Update docs and docstring [ci skip] 2020-10-05 21:55:27 +02:00
Ines Montani
9614e53b02 Tidy up and auto-format 2020-10-05 21:55:18 +02:00
Matthew Honnibal
e50047f1c5 Check lengths match 2020-10-05 20:02:45 +02:00
Matthew Honnibal
cdd2b79b6d Remove deprecated MultiHashEmbed 2020-10-05 19:58:18 +02:00
Matthew Honnibal
6dcc4a0ba6 Simplify MultiHashEmbed signature 2020-10-05 19:57:45 +02:00
Matthew Honnibal
eb9ba61517 Format 2020-10-05 15:29:49 +02:00
Matthew Honnibal
8ec79ad3fa Allow configuration of MultiHashEmbed features
Update arguments to MultiHashEmbed layer so that the attributes can be
controlled. A kind of tricky scheme is used to allow optional
specification of the rows. I think it's an okay balance between
flexibility and convenience.
2020-10-05 15:22:00 +02:00
Ines Montani
bcd52e5486 Tidy up errors and warnings 2020-10-04 11:16:31 +02:00
Ines Montani
3bc3c05fcc Tidy up and auto-format 2020-10-03 17:20:18 +02:00
svlandeg
02247cccaf Merge remote-tracking branch 'upstream/develop' into feature/small-fixes 2020-10-02 20:48:11 +02:00
Matthew Honnibal
6965cdf16d Fix comment 2020-10-02 17:26:21 +02:00
Ines Montani
af282ae732 Fix import 2020-10-02 01:12:34 +02:00
Ines Montani
e59ecb12c0 Auto-format 2020-10-02 01:12:30 +02:00
Matthew Honnibal
75a1569908 Merge 2020-10-01 23:07:53 +02:00
Matthew Honnibal
300e5a9928
Avoid relying on NORM in default v3 models (#6176)
* Allow CharacterEmbed to specify feature

* Default to LOWER in character embed

* Update tok2vec

* Use LOWER, not NORM
2020-10-01 23:05:55 +02:00
Matthew Honnibal
b854bca15c Default to LOWER in character embed 2020-10-01 22:17:58 +02:00
Matthew Honnibal
684a77870b Allow CharacterEmbed to specify feature 2020-10-01 22:17:26 +02:00
Sofie Van Landeghem
a22215f427
Add FeatureExtractor from Thinc (#6170)
* move featureextractor from Thinc

* Update website/docs/api/architectures.md

Co-authored-by: Ines Montani <ines@ines.io>

* Update website/docs/api/architectures.md

Co-authored-by: Ines Montani <ines@ines.io>

Co-authored-by: Ines Montani <ines@ines.io>
2020-10-01 16:22:48 +02:00
svlandeg
5121972930 add types of Tok2Vec embedding layers 2020-10-01 09:20:09 +02:00
svlandeg
5a9fdbc8ad state_type as Literal 2020-09-23 17:32:14 +02:00
svlandeg
25b34bba94 throw custom error when state_type is invalid 2020-09-23 16:57:14 +02:00
svlandeg
dd2292793f 'parser' instead of 'deps' for state_type 2020-09-23 16:53:49 +02:00
svlandeg
6c85fab316 state_type and extra_state_tokens instead of nr_feature_tokens 2020-09-23 13:35:09 +02:00
Ines Montani
1114219ae3 Tidy up and auto-format 2020-09-21 10:59:07 +02:00
Adriane Boyd
f3db3f6fe0
Add vectors option to CharacterEmbed (#6069)
* Add vectors option to CharacterEmbed

* Update spacy/pipeline/morphologizer.pyx

* Adjust default morphologizer config

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2020-09-16 17:45:04 +02:00
Ines Montani
1955aaaa20
Merge pull request #6045 from svlandeg/feature/more-layers-docs [ci skip] 2020-09-09 21:46:40 +02:00
Sofie Van Landeghem
cb66ea7400
Remove simple_ner code (#6041)
* remove simple_ner code

* remove unused _biluo and _iob files
2020-09-09 16:11:27 +02:00
svlandeg
39aa740777 Merge remote-tracking branch 'upstream/develop' into feature/more-layers-docs 2020-09-09 11:59:34 +02:00
Sofie Van Landeghem
60f22e1800
Pipe API (#6034)
* ensure Language passes on valid examples for initialization

* fix tagger model initialization

* check for valid get_examples across components

* assume labels were added before begin_training

* fix senter initialization

* fix morphologizer initialization

* use methods to check arguments

* test textcat init, requires thinc>=8.0.0a31

* fix tok2vec init

* fix entity linker init

* use islice

* fix simple NER

* cleanup debug model

* fix assert statements

* fix tests

* throw error when adding a label if the output layer can't be resized anymore

* fix test

* add failing test for simple_ner

* UX improvements

* morphologizer UX

* assume begin_training gets a representative set and processes the labels

* remove assumptions for output of untrained NER model

* restore test for original purpose
2020-09-08 22:44:25 +02:00
svlandeg
bd8f9b188b small fixes 2020-09-08 17:24:36 +02:00
svlandeg
06ef66fd73 Merge remote-tracking branch 'upstream/develop' into feature/more-layers-docs 2020-09-08 10:28:42 +02:00
svlandeg
c32fcdf4c9 fix typo 2020-09-04 09:10:21 +02:00
Ines Montani
5afe6447cd registry.assets -> registry.misc 2020-09-03 17:31:14 +02:00
Ines Montani
091a9b522a Remove unused variable [ci skip] 2020-08-29 13:11:26 +02:00
Matthew Honnibal
160a855246 Format 2020-08-23 21:15:12 +02:00
Sofie Van Landeghem
358cbb21e3
Define candidate generator in EL config (#5876)
* candidate generator as separate part of EL config

* update comment

* ent instead of str as input for candidate generation

* Span instead of str: correct type indication

* fix types

* unit test to create new candidate generator

* fix replace_pipe argument passing

* move error message, general cleanup

* add vocab back to KB constructor

* provide KB as callable from Vocab arg

* rename to kb_loader, fix KB serialization as part of the EL pipe

* fix typo

* reformatting

* cleanup

* fix comment

* fix wrongly duplicated code from merge conflict

* rename dump to to_disk

* from_disk instead of load_bulk

* update test after recent removal of set_morphology in tagger

* remove old doc
2020-08-18 16:10:36 +02:00
Ines Montani
3a193eb8f1 Fix imports, types and default configs 2020-08-07 18:40:54 +02:00
Matthew Honnibal
b1d83fc13e Fix imports 2020-08-07 16:55:54 +02:00
Matthew Honnibal
473504d837 Format 2020-08-07 16:49:00 +02:00
Matthew Honnibal
234c52a91e Add tok2vec docstrings 2020-08-07 16:48:48 +02:00
Matthew Honnibal
547bc8a82b Add docstring notes 2020-08-07 16:17:34 +02:00
Matthew Honnibal
da6e59519e Add docstrings for simple_ner 2020-08-07 15:09:49 +02:00
Matthew Honnibal
7ef8a64df9 Add docstring for parser 2020-08-07 14:59:34 +02:00
Ines Montani
e68459296d Tidy up and auto-format 2020-08-05 16:00:59 +02:00
Sofie Van Landeghem
82347110f5
Default empty KB in EL component (#5872)
* EL field documentation

* documentation consistent with docs

* default empty KB, initialize vocab separately

* formatting

* add test for changing the default entity vector length

* update comment
2020-08-04 14:34:09 +02:00
Ines Montani
e9e8fa2466 Update docs and types 2020-07-31 17:02:54 +02:00
Sofie Van Landeghem
ca491722ad
The Parser is now a Pipe (2) (#5844)
* moving syntax folder to _parser_internals

* moving nn_parser and transition_system

* move nn_parser and transition_system out of internals folder

* moving nn_parser code into transition_system file

* rename transition_system to transition_parser

* moving parser_model and _state to ml

* move _state back to internals

* The Parser now inherits from Pipe!

* small code fixes

* removing unnecessary imports

* remove link_vectors_to_models

* transition_system to internals folder

* little bit more cleanup

* newlines
2020-07-30 23:30:54 +02:00
Matthew Honnibal
142b58be92 Fix import 2020-07-29 14:45:09 +02:00
Matthew Honnibal
c99a653070 Adjust textcat model 2020-07-29 14:38:15 +02:00
Matthew Honnibal
9e1b11dd81 Update vectors in textcat 2020-07-29 14:35:36 +02:00
Matthew Honnibal
07b47eaac8 Update tok2vec layer 2020-07-29 14:01:13 +02:00
Matthew Honnibal
5ae8628571 Fix CharacterEmbed layer 2020-07-29 14:01:13 +02:00
Matthew Honnibal
00de30bcc2 Update CharacterEmbed function 2020-07-29 14:01:12 +02:00
Matthew Honnibal
c35d6282fc Add previous HashEmbedCNN tok2vec to make transition easier 2020-07-29 14:01:12 +02:00
Matthew Honnibal
0c17ea4c85 Format 2020-07-29 14:00:13 +02:00
Matthew Honnibal
475d7c1c7c Fix StaticVectors class 2020-07-29 14:00:11 +02:00
Matthew Honnibal
44d350dc94 Use spaCy's StaticVectors 2020-07-29 14:00:11 +02:00
Matthew Honnibal
099e9331c5 Fix tok2vec 2020-07-29 14:00:10 +02:00
Matthew Honnibal
fe0cdcd461 Fixes 2020-07-29 14:00:09 +02:00
Matthew Honnibal
123f8b832d Refactor Tok2Vec model 2020-07-29 14:00:09 +02:00
Matthew Honnibal
c6b4f63c7c Remove obsolete function 2020-07-29 14:00:09 +02:00
Matthew Honnibal
9cc7262224 Draft StaticVectors layer 2020-07-29 14:00:09 +02:00
Matthew Honnibal
cb9654e98c WIP on new StaticVectors 2020-07-29 14:00:09 +02:00
Ines Montani
ed61fb10fc Rename default textcat arch to TextCatEnsemble 2020-07-26 15:11:43 +02:00
Ines Montani
e92df281ce Tidy up, autoformat, add types 2020-07-25 15:01:15 +02:00
Ines Montani
43b960c01b
Refactor pipeline components, config and language data (#5759)
* Update with WIP

* Update with WIP

* Update with pipeline serialization

* Update types and pipe factories

* Add deep merge, tidy up and add tests

* Fix pipe creation from config

* Don't validate default configs on load

* Update spacy/language.py

Co-authored-by: Ines Montani <ines@ines.io>

* Adjust factory/component meta error

* Clean up factory args and remove defaults

* Add test for failing empty dict defaults

* Update pipeline handling and methods

* provide KB as registry function instead of as object

* small change in test to make functionality more clear

* update example script for EL configuration

* Fix typo

* Simplify test

* Simplify test

* splitting pipes.pyx into separate files

* moving default configs to each component file

* fix batch_size type

* removing default values from component constructors where possible (TODO: test 4725)

* skip instead of xfail

* Add test for config -> nlp with multiple instances

* pipeline.pipes -> pipeline.pipe

* Tidy up, document, remove kwargs

* small cleanup/generalization for Tok2VecListener

* use DEFAULT_UPSTREAM field

* revert to avoid circular imports

* Fix tests

* Replace deprecated arg

* Make model dirs require config

* fix pickling of keyword-only arguments in constructor

* WIP: clean up and integrate full config

* Add helper to handle function args more reliably

Now also includes keyword-only args

* Fix config composition and serialization

* Improve config debugging and add visual diff

* Remove unused defaults and fix type

* Remove pipeline and factories from meta

* Update spacy/default_config.cfg

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update spacy/default_config.cfg

* small UX edits

* avoid printing stack trace for debug CLI commands

* Add support for language-specific factories

* specify the section of the config which holds the model to debug

* WIP: add Language.from_config

* Update with language data refactor WIP

* Auto-format

* Add backwards-compat handling for Language.factories

* Update morphologizer.pyx

* Fix morphologizer

* Update and simplify lemmatizers

* Fix Japanese tests

* Port over tagger changes

* Fix Chinese and tests

* Update to latest Thinc

* WIP: xfail first Russian lemmatizer test

* Fix component-specific overrides

* fix nO for output layers in debug_model

* Fix default value

* Fix tests and don't pass objects in config

* Fix deep merging

* Fix lemma lookup data registry

Only load the lookups if an entry is available in the registry (and if spacy-lookups-data is installed)

* Add types

* Add Vocab.from_config

* Fix typo

* Fix tests

* Make config copying more elegant

* Fix pipe analysis

* Fix lemmatizers and is_base_form

* WIP: move language defaults to config

* Fix morphology type

* Fix vocab

* Remove comment

* Update to latest Thinc

* Add morph rules to config

* Tidy up

* Remove set_morphology option from tagger factory

* Hack use_gpu

* Move [pipeline] to top-level block and make [nlp.pipeline] list

Allows separating component blocks from component order – otherwise, ordering the config would mean a changed component order, which is bad. Also allows initial config to define more components and not use all of them

* Fix use_gpu and resume in CLI

* Auto-format

* Remove resume from config

* Fix formatting and error

* [pipeline] -> [components]

* Fix types

* Fix tagger test: requires set_morphology?

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>
Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2020-07-22 13:42:59 +02:00
Sofie Van Landeghem
6f3bb6f77c
fix doc.to_utf8 on GPU (#5757) 2020-07-13 23:05:33 +02:00
Ines Montani
5f6f4ff594 Remove object subclassing 2020-07-12 14:03:23 +02:00
Sofie Van Landeghem
c1ea55307b
Fixing reproducible training (#5735)
* Add initial reproducibility tests

* failing test for default_text_classifier (WIP)

* track trouble to underlying tok2vec layer

* add regression test for Issue 5551

* tests go green with https://github.com/explosion/thinc/pull/359

* update test

* adding fixed seeds to HashEmbed layers, seems to fix the reproducility issue

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2020-07-09 19:39:31 +02:00
Matthw Honnibal
433dc3c9c9 Simplify PrecomputableAffine slightly 2020-07-07 17:22:47 +02:00
Matthw Honnibal
d1fd3438c3 Add dropout to parser hidden layer 2020-07-07 01:38:15 +02:00
Matthw Honnibal
709fc5e4ad Clarify dropout and seed in Tok2Vec 2020-07-06 17:50:21 +02:00
Matthw Honnibal
3f6f087113 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2020-07-04 23:52:12 +02:00
Matthw Honnibal
8870a6ded7 Specify seeds in HashEmbed 2020-07-04 23:51:49 +02:00
Ines Montani
37c3bb35e2 Auto-format 2020-07-04 16:25:34 +02:00