Commit Graph

992 Commits

Author SHA1 Message Date
Ines Montani
5497acf49a Support config overrides via environment variables 2020-09-21 11:25:10 +02:00
Ines Montani
1114219ae3 Tidy up and auto-format 2020-09-21 10:59:07 +02:00
Ines Montani
b2302c0a1c Improve error for missing dependency 2020-09-20 17:44:51 +02:00
Matthew Honnibal
8fb59d958c Format 2020-09-20 16:31:48 +02:00
Matthew Honnibal
dc22771f87 Fix sparse checkout 2020-09-20 16:30:05 +02:00
Matthew Honnibal
a0fb5e50db Use simple git clone call if not sparse 2020-09-20 16:22:04 +02:00
Matthew Honnibal
2c24d633d0 Use updated run_command 2020-09-20 16:21:43 +02:00
Ines Montani
554c9a2497 Update docs [ci skip] 2020-09-20 12:30:53 +02:00
svlandeg
6db1d5dc0d trying some stuff 2020-09-19 19:11:30 +02:00
Ines Montani
e863b3dc14
Merge pull request #6092 from adrianeboyd/bugfix/load-vocab-lookups-2 2020-09-19 12:33:38 +02:00
Sofie Van Landeghem
39872de1f6
Introducing the gpu_allocator (#6091)
* rename 'use_pytorch_for_gpu_memory' to 'gpu_allocator'

* --code instead of --code-path

* update documentation

* avoid querying the "system" section directly

* add explanation of gpu_allocator to TF/PyTorch section in docs

* fix typo

* fix typo 2

* use set_gpu_allocator from thinc 8.0.0a34

* default null instead of empty string
2020-09-19 01:17:02 +02:00
svlandeg
73ff52b9ec hack for tok2vec listener 2020-09-18 16:43:15 +02:00
Adriane Boyd
eed4b785f5 Load vocab lookups tables at beginning of training
Similar to how vectors are handled, move the vocab lookups to be loaded
at the start of training rather than when the vocab is initialized,
since the vocab doesn't have access to the full config when it's
created.

The option moves from `nlp.load_vocab_data` to `training.lookups`.

Typically these tables will come from `spacy-lookups-data`, but any
`Lookups` object can be provided.

The loading from `spacy-lookups-data` is now strict, so configs for each
language should specify the exact tables required. This also makes it
easier to control whether the larger clusters and probs tables are
included.

To load `lexeme_norm` from `spacy-lookups-data`:

```
[training.lookups]
@misc = "spacy.LoadLookupsData.v1"
lang = ${nlp.lang}
tables = ["lexeme_norm"]
```
2020-09-18 15:59:16 +02:00
Ines Montani
a127fa475e
Merge pull request #6078 from svlandeg/fix/corpus 2020-09-18 14:44:21 +02:00
svlandeg
e4fc7e0222 fixing output sample to proper 2D array 2020-09-17 22:34:36 +02:00
Ines Montani
3865214343 Use consistent shortcut 2020-09-17 16:57:02 +02:00
svlandeg
35a3931064 fix typo 2020-09-17 16:36:27 +02:00
svlandeg
ddfc1fc146 add pretraining option to init config 2020-09-17 16:05:40 +02:00
svlandeg
427dbecdd6 cleanup and formatting 2020-09-17 11:48:04 +02:00
svlandeg
0c35885751 generalize corpora, dot notation for dev and train corpus 2020-09-17 11:38:59 +02:00
svlandeg
51fa929f47 rewrite train_corpus to corpus.train in config 2020-09-15 21:58:04 +02:00
Ines Montani
9cc304c194
Merge pull request #6064 from explosion/fix/sparse-checkout-ux
Fix sparse checkout and error handling
2020-09-15 00:32:20 +02:00
Sofie Van Landeghem
3216a33149
positive_label config for textcat (#6062)
* hook up positive_label in textcat

* unit tests

* documentation

* formatting

* tests

* fix typo

* move verify_config to after begin_training

* revert accidential commit
2020-09-14 17:08:00 +02:00
Ines Montani
c052017025 Fix sparse checkout and error handling 2020-09-14 14:12:58 +02:00
Matthew Honnibal
54c40223a1
Improve v3 pretrain command (#6040)
* Starts to run

* Update pretrain script

* Update corpus

* Update pretrain schema

* Remove outdated test

* Make JsonlTexts produce Example objects.
2020-09-13 14:05:05 +02:00
Ines Montani
febb99916d Tidy up and auto-format [ci skip] 2020-09-13 10:55:36 +02:00
Ines Montani
a5633b205f Fix handling of errors around git [ci skip] 2020-09-13 10:52:28 +02:00
Ines Montani
f8846c198d Update types and docstrings 2020-09-13 10:52:02 +02:00
Matthew Honnibal
37347830d4 Fix reading in GloVe vectors 2020-09-12 17:31:18 +02:00
Ines Montani
b41be87213
Merge pull request #6051 from svlandeg/feature/cli-config 2020-09-12 17:12:35 +02:00
Ines Montani
eedaaaec75 Fix handling of existing asset without checksum [ci skip] 2020-09-12 17:02:53 +02:00
svlandeg
a75cfe0da6 Merge remote-tracking branch 'upstream/develop' into feature/cli-config 2020-09-12 14:44:40 +02:00
svlandeg
115147804a string_to_list to parse comma-separated string into a list 2020-09-12 14:43:22 +02:00
Ines Montani
f886f5bbc8
Merge pull request #6048 from explosion/fix/clone-compat 2020-09-12 10:30:49 +02:00
Ines Montani
0b2e07215d Support overwriting name on spacy package 2020-09-11 11:38:28 +02:00
svlandeg
5b94aeece9 support pipeline as "list in string" 2020-09-11 11:08:46 +02:00
Ines Montani
1bce432b4a Adjust message [ci skip] 2020-09-11 10:00:49 +02:00
Ines Montani
5acd4fbcd8 Merge branch 'develop' into fix/clone-compat 2020-09-11 09:58:30 +02:00
Ines Montani
761bd60d43 Adjust info message 2020-09-11 09:57:00 +02:00
Ines Montani
6831161bfa Resolve path to be extra sure 2020-09-11 09:56:49 +02:00
svlandeg
1723fb73c4 remove brol 2020-09-10 17:44:59 +02:00
svlandeg
08a831ce83 process trailing slash if any 2020-09-10 17:39:52 +02:00
Ines Montani
3e83a509bb WIP: fix project clone compatibility 2020-09-10 15:49:13 +02:00
svlandeg
f1bc09c1e9 restore partly 2020-09-10 14:53:02 +02:00
svlandeg
3889747119 asset fix & UX 2020-09-10 14:36:53 +02:00
svlandeg
a36766d153 hookup branch 2020-09-10 12:00:34 +02:00
svlandeg
97d99f7efa Merge remote-tracking branch 'upstream/develop' into feature/doc-fixes 2020-09-10 11:51:34 +02:00
Ines Montani
908f3a4494 Update default projects repo [ci skip] 2020-09-10 11:42:14 +02:00
svlandeg
92f9d2f406 small UX fixes 2020-09-10 11:35:50 +02:00
svlandeg
1fc5486792 more fine-grained errors for git_sparse_checkout 2020-09-10 11:31:32 +02:00
Ines Montani
15bc3a37b4 Add --branch to project clone 2020-09-10 11:08:15 +02:00
Sofie Van Landeghem
8e7557656f
Renaming gold & annotation_setter (#6042)
* version bump to 3.0.0a16

* rename "gold" folder to "training"

* rename 'annotation_setter' to 'set_extra_annotations'

* formatting
2020-09-09 10:31:03 +02:00
Sofie Van Landeghem
60f22e1800
Pipe API (#6034)
* ensure Language passes on valid examples for initialization

* fix tagger model initialization

* check for valid get_examples across components

* assume labels were added before begin_training

* fix senter initialization

* fix morphologizer initialization

* use methods to check arguments

* test textcat init, requires thinc>=8.0.0a31

* fix tok2vec init

* fix entity linker init

* use islice

* fix simple NER

* cleanup debug model

* fix assert statements

* fix tests

* throw error when adding a label if the output layer can't be resized anymore

* fix test

* add failing test for simple_ner

* UX improvements

* morphologizer UX

* assume begin_training gets a representative set and processes the labels

* remove assumptions for output of untrained NER model

* restore test for original purpose
2020-09-08 22:44:25 +02:00
Matthew Honnibal
ba5f4c9b32 Add words and seconds to train info 2020-09-08 15:24:47 +02:00
Matthew Honnibal
b470062153
Add CLI registry (#6037) 2020-09-08 15:23:34 +02:00
Matthew Honnibal
4b7abaafdb Fix learn rate for non-transformer 2020-09-04 21:22:50 +02:00
Matthew Honnibal
465785a672 Fix project pull and push 2020-09-04 21:15:55 +02:00
Ines Montani
ab1bb421ed Update docs links in codebase 2020-09-04 12:58:50 +02:00
Ines Montani
2189046869
Merge pull request #6024 from explosion/chore/registry-renaming 2020-09-04 10:54:10 +02:00
Matthew Honnibal
1c07820681 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2020-09-03 18:54:21 +02:00
Matthew Honnibal
7be8a0516a Fix project pull 2020-09-03 18:54:03 +02:00
Ines Montani
23b7d9cfa3 Prefix span getters 2020-09-03 17:37:06 +02:00
Ines Montani
c063e55eb7 Add prefix to batchers 2020-09-03 17:30:41 +02:00
Ines Montani
c53b1433b9 Adjust more arguments [ci skip] 2020-09-03 17:12:24 +02:00
Ines Montani
b5a0657fd6 "model" terminology consistency in docs 2020-09-03 13:13:03 +02:00
Matthew Honnibal
122cb02001 Fix averages 2020-09-02 19:37:43 +02:00
Sofie Van Landeghem
6bfb1b3a29
Fix sparse checkout for 'spacy project' (#6008)
* exit if cloning fails

* UX

* rewrite http link to git protocol, don't use stdin

* fixes to sparse checkout

* formatting
2020-09-01 19:49:01 +02:00
Ines Montani
70b226f69d Support ignore marker in project document [ci skip] 2020-09-01 12:49:04 +02:00
Ines Montani
a4c51f0f18 Add v3 info to project docs [ci skip] 2020-09-01 12:36:21 +02:00
Ines Montani
ef9005273b Update fill-config command and add silent mode [ci skip] 2020-09-01 12:07:04 +02:00
Matthew Honnibal
ec660e3131 Fix use_pytorch_for_gpu_memory 2020-09-01 00:41:38 +02:00
Matthw Honnibal
c38298b8fa Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2020-08-31 19:55:55 +02:00
Matthw Honnibal
fe298fa50a Shuffle on first epoch of train 2020-08-31 19:55:22 +02:00
svlandeg
13ee742fb4 example of custom logger 2020-08-31 14:24:41 +02:00
svlandeg
c18eb63483 Merge remote-tracking branch 'upstream/develop' into feature/vectors-docs
# Conflicts:
#	website/docs/usage/embeddings-transformers.md
2020-08-31 13:21:36 +02:00
Sofie Van Landeghem
ec14744ee4
Rename Transformer listener (#6001)
* rename to spacy-transformers.TransformerListener

* add some more tok2vec tests

* use select_pipes

* fix docs - annotation setter was not changed in the end
2020-08-31 12:41:39 +02:00
Ines Montani
45f46a5c85
Merge pull request #5993 from explosion/feature/disabled-components 2020-08-29 15:58:41 +02:00
Ines Montani
34146750d4 Use frozen list with custom errors
We don't want to break backwards compatibility too much but we also want to provide the best possible UX
2020-08-29 15:20:11 +02:00
Ines Montani
2bc31e15c9 Tidy up and auto-format [ci skip] 2020-08-29 13:01:10 +02:00
svlandeg
5230529de2 add loggers registry & logger docs sections 2020-08-28 21:44:04 +02:00
Ines Montani
4ca2698f85 Merge branch 'develop' into feature/debug-config 2020-08-28 11:19:17 +02:00
Ines Montani
d1780db6a4 Tidy up and use different error [ci skip] 2020-08-27 18:56:55 +02:00
Ines Montani
ff4175e839 Add more info to debug config 2020-08-27 18:17:58 +02:00
Ines Montani
8692d176f6
Merge pull request #5978 from explosion/feature/update-wasabi
Update wasabi: new diff_strings and MarkdownRenderer
2020-08-26 19:02:52 +02:00
Matthew Honnibal
9b22714a4e Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2020-08-26 15:48:45 +02:00
Matthew Honnibal
172af24f95 Fix upload and download 2020-08-26 15:48:23 +02:00
Ines Montani
a5fff1df51 Remove outdated non-empty output dir warning [ci skip] 2020-08-26 15:45:51 +02:00
Ines Montani
3aec98ca38 Update wasabi: new diff_strings and MarkdownRenderer 2020-08-26 15:33:11 +02:00
Sofie Van Landeghem
79d460e3a2
Weights & Biases logger for train CLI (#5971)
* quick test as part of train script

* train_logger in config, default ConsoleLogger in loggers catalogue

* entitiy typo

* add wandb_logger

* cleanup

* Update spacy/cli/train_logger.py

Co-authored-by: Ines Montani <ines@ines.io>

* move loggers to gold.loggers

Co-authored-by: Ines Montani <ines@ines.io>
2020-08-26 15:24:33 +02:00
Ines Montani
0997c30b9e
Merge pull request #5974 from explosion/feature/project-document 2020-08-26 15:14:13 +02:00
Ines Montani
627617a079 Tidy up and add docs [ci skip] 2020-08-26 13:24:55 +02:00
Ines Montani
aeebc6678d Small cleanup and adjustments 2020-08-26 10:26:57 +02:00
Ines Montani
31567d1e42 Link project.yml 2020-08-26 10:26:32 +02:00
Ines Montani
6c2a5ff53b Auto-link local sources 2020-08-26 10:26:06 +02:00
Matthew Honnibal
2771e4f2b3
Fix the git "sparse checkout" functionality (#5973)
* Fix the git sparse checkout functionality

* Format
2020-08-26 04:00:14 +02:00
Ines Montani
1c958a76c1 Add comment markers to only replace auto-generated docs 2020-08-26 00:03:06 +02:00
Ines Montani
f10989e8c4 Add "project document" and more project.yml meta fields 2020-08-25 17:14:27 +02:00
Ines Montani
fdcaf86c54 Adjust docstring
End sentence earlier so it's shown as a full sentence in --help
2020-08-25 17:13:50 +02:00
Ines Montani
b89f6fa011 Fix meta defaults and error in package command 2020-08-25 17:13:33 +02:00
Ines Montani
dd84577a98 Update CLI utils, project.yml schema and add test 2020-08-25 11:54:53 +02:00
Matthew Honnibal
8038b87f04
Various small tweaks to project CLI (#5965)
* Fix up/download of http and local paths

* Support git_sparse_checkout for assets

* Fix scorer

* Handle already-present directories for git assets

* Improve convert command

* Fix support for existant files in git assets

* Support branches in git sparse checkout

* Format

* Fix git assets

* Document git block in assets

* Fix test

* Fix test

* Revert "Fix test"

This reverts commit cf3097260f.

* Revert "Fix test"

This reverts commit 964d636e27.

* Dont multiply p/r/f by 100

* Display scores * 100 during training
2020-08-25 00:30:52 +02:00
Ines Montani
e12b03358b
Support removing extra values in fill-config (#5966)
* Support removing extra values in fill-config

* Fix test
2020-08-24 22:53:47 +02:00
Ines Montani
0e7f99da58
Fix handling of optional [pretraining] block (#5954)
* Fix handling of optional [pretraining] block

* Remote pretraining from default config

* Fix test

* Add schema option for empty pretrain block
2020-08-24 15:56:03 +02:00
Matthew Honnibal
64df37643f Update lockfile after project pull 2020-08-24 03:27:09 +02:00
Matthew Honnibal
588c28fe45 Fix project pull when deps missing 2020-08-24 01:23:36 +02:00
Matthew Honnibal
160a855246 Format 2020-08-23 21:15:12 +02:00
Matthew Honnibal
89f5b8abb3 Fix project push 2020-08-23 21:14:44 +02:00
Matthew Honnibal
3828bc3ed0 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2020-08-23 18:32:24 +02:00
Matthew Honnibal
e559867605
Allow spacy project to push and pull to/from remote storage (#5949)
* Add utils for working with remote storage

* WIP add remote_cache for project

* WIP add push and pull commands

* Use pathy in remote_cache

* Updarte util

* Update remote_cache

* Update util

* Update project assets

* Update pull script

* Update push script

* Fix type annotation in util

* Work on remote storage

* Remove site and env hash

* Fix imports

* Fix type annotation

* Require pathy

* Require pathy

* Fix import

* Add a util to handle project variable substitution

* Import push and pull commands

* Fix pull command

* Fix push command

* Fix tarfile in remote_storage

* Improve printing

* Fiddle with status messages

* Set version to v3.0.0a9

* Draft docs for spacy project remote storages

* Update docs [ci skip]

* Use Thinc config to simplify and unify template variables

* Auto-format

* Don't import Pathy globally for now

Causes slow and annoying Google Cloud warning

* Tidy up test

* Tidy up and update tests

* Update to latest Thinc

* Update docs

* variables -> vars

* Update docs [ci skip]

* Update docs [ci skip]

Co-authored-by: Ines Montani <ines@ines.io>
2020-08-23 18:32:09 +02:00
Matthew Honnibal
fe1cf7e124 Allow score_weights to list extra scores 2020-08-23 18:31:30 +02:00
Ines Montani
9bdc9e81f5 Fix error message [ci skip] 2020-08-23 12:14:02 +02:00
Ines Montani
3826cfb8fe
Merge pull request #5930 from svlandeg/feature/init-config-fix
UX for init config
2020-08-21 12:06:33 +02:00
Ines Montani
79af7dcd6d Small wording adjustments [ci skip] 2020-08-21 12:06:19 +02:00
Matthew Honnibal
c356e62908 Minor adjustments to quickstart template 2020-08-21 00:10:21 +02:00
Ines Montani
6ad59d59fe Merge branch 'develop' of https://github.com/explosion/spaCy into develop [ci skip] 2020-08-20 11:20:58 +02:00
svlandeg
b96cd9fa5e fix typo 2020-08-19 18:46:08 +02:00
Ines Montani
e2f2ef3a5a Update init config and recommendations
- As much as I dislike YAML, it seemed like a better format here because it allows us to add comments if we want to explain the different recommendations
- Don't include the generated JS in the repo by default and build it on the fly when running or deploying the site. This ensures it's always up to date.
- Simplify jinja_to_js script and use fewer dependencies
2020-08-19 13:33:15 +02:00
svlandeg
a8acedd4ba example of custom reader and batcher 2020-08-18 19:15:16 +02:00
Sofie Van Landeghem
688e77562b
Train CLI script fixes (#5931)
* fix dash replacement in overrides arguments

* perform interpolation on training config

* make sure only .spacy files are read
2020-08-18 16:06:37 +02:00
Ines Montani
82f0e20318 Update docs and consistency [ci skip] 2020-08-18 14:39:40 +02:00
svlandeg
10e67b400c output_file required, spacy-transformers prefered instead of required 2020-08-18 13:38:43 +02:00
Ines Montani
990c6b4c32 Update docs and CLI [ci skip] 2020-08-17 21:38:20 +02:00
Ines Montani
3ae5e02f4f Update docs, types and API consistency 2020-08-17 16:45:24 +02:00
Ines Montani
6ae83bde0c Fix CLI consistency [ci skip] 2020-08-16 15:46:29 +02:00
Ines Montani
45f13cbf64
Merge pull request #5916 from explosion/feature/new-thinc-config 2020-08-16 15:24:12 +02:00
Ines Montani
34bda91695 Show warnings if there's nothing to auto-fill 2020-08-16 14:19:43 +02:00
Ines Montani
dd5804d499 Update type hints 2020-08-16 14:19:33 +02:00
Ines Montani
a570c304df Update quickstart, template and docs 2020-08-15 14:50:29 +02:00
Ines Montani
fdcde9b0bf Add init fill-config 2020-08-14 16:49:26 +02:00
Ines Montani
8128e5eb35 Replace lexeme_norm warning with logging 2020-08-14 15:00:52 +02:00
Ines Montani
37814b608d Remove env_opt and simplfy default Optimizer 2020-08-14 14:59:54 +02:00
Ines Montani
ab1d165bba Pass optimizer defined in config to resume/begin_training
Otherwise, this would create a default optimizer, which isn't what we want?
2020-08-14 14:59:22 +02:00
Ines Montani
67cc39af7f Update Thinc and include section order 2020-08-14 14:06:22 +02:00
Ines Montani
88b0a96801 Update for new Thinc and adjust config 2020-08-13 17:38:30 +02:00
Ines Montani
950832f087
Tidy up pipes (#5906)
* Tidy up pipes

* Fix init, defaults and raise custom errors

* Update docs

* Update docs [ci skip]

* Apply suggestions from code review

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>

* Tidy up error handling and validation, fix consistency

* Simplify get_examples check

* Remove unused import [ci skip]

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2020-08-11 23:29:31 +02:00
Ines Montani
d5c78c7a34 Update docs and fix consistency 2020-08-09 22:31:52 +02:00
Ines Montani
1d01d89b79 Update CLI docs and evaluate command [ci skip] 2020-08-07 14:40:58 +02:00
Ines Montani
913d21f0a3
Merge pull request #5882 from explosion/feature/raise-from
Use "raise ... from" in custom errors for better tracebacks
2020-08-06 00:35:26 +02:00
Ines Montani
06e80d95cd
Sync develop with nightly docs state (#5883)
Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>
2020-08-06 00:28:14 +02:00
Ines Montani
d92954ac1d
Merge pull request #5881 from explosion/feature/better-error-model-shortcuts 2020-08-06 00:13:35 +02:00
Ines Montani
56c17973aa Use "raise ... from" in custom errors for better tracebacks 2020-08-05 23:53:21 +02:00
Ines Montani
5cc0d89fad
Simplify config overrides in CLI and deserialization (#5880) 2020-08-05 23:35:09 +02:00
Ines Montani
2a1fa86a0d Add better error for failed model shortcut loading 2020-08-05 23:10:29 +02:00
Ines Montani
586d695775 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2020-08-05 16:01:11 +02:00
Ines Montani
e68459296d Tidy up and auto-format 2020-08-05 16:00:59 +02:00
Matthew Honnibal
50c0e49741 Fix train CLI 2020-08-05 15:40:47 +02:00
Ines Montani
b795f02fbd
Allow adding pipeline components from source model (#5857)
* Allow adding pipeline components from source model

* Config: name -> component

* Improve error messages

* Fix error and test

* Add frozen components and exclude logic

* Remove exclude from Language.evaluate

* Init sourced components with current vocab

* Fix error codes
2020-08-04 23:39:19 +02:00
Matthew Honnibal
ecb3c4e8f4
Create corpus iterator and batcher from registry during training (#5865)
* Move batchers into their own module (and registry)

* Update CLI

* Update Corpus and batcher

* Update tests

* Update one config

* Merge 'evaluation' block back under [training]

* Import batchers in gold __init__

* Fix batchers

* Update config

* Update schema

* Update util

* Don't assume train and dev are actually paths

* Update onto-joint config

* Fix missing import

* Format

* Format

* Update spacy/gold/corpus.py

Co-authored-by: Ines Montani <ines@ines.io>

* Fix name

* Update default config

* Fix get_length option in batchers

* Update test

* Add comment

* Pass path into Corpus

* Update docstring

* Update schema and configs

* Update config

* Fix test

* Fix paths

* Fix print

* Fix create_train_batches

* [training.read_train] -> [training.train_corpus]

* Update onto-joint config

Co-authored-by: Ines Montani <ines@ines.io>
2020-08-04 15:09:37 +02:00
Ines Montani
934447a611
Merge pull request #5855 from svlandeg/fix/cli-debug 2020-08-03 13:09:20 +02:00
Ines Montani
4c055f0aa7
Add init CLI and init config (#5854)
* Add init CLI and init config draft

* Improve config validation

* Auto-format

* Don't export anything in debug config

* Update docs
2020-08-02 15:18:30 +02:00