Matthew Honnibal
ecb3c4e8f4
Create corpus iterator and batcher from registry during training ( #5865 )
...
* Move batchers into their own module (and registry)
* Update CLI
* Update Corpus and batcher
* Update tests
* Update one config
* Merge 'evaluation' block back under [training]
* Import batchers in gold __init__
* Fix batchers
* Update config
* Update schema
* Update util
* Don't assume train and dev are actually paths
* Update onto-joint config
* Fix missing import
* Format
* Format
* Update spacy/gold/corpus.py
Co-authored-by: Ines Montani <ines@ines.io>
* Fix name
* Update default config
* Fix get_length option in batchers
* Update test
* Add comment
* Pass path into Corpus
* Update docstring
* Update schema and configs
* Update config
* Fix test
* Fix paths
* Fix print
* Fix create_train_batches
* [training.read_train] -> [training.train_corpus]
* Update onto-joint config
Co-authored-by: Ines Montani <ines@ines.io>
2020-08-04 15:09:37 +02:00
Sofie Van Landeghem
82347110f5
Default empty KB in EL component ( #5872 )
...
* EL field documentation
* documentation consistent with docs
* default empty KB, initialize vocab separately
* formatting
* add test for changing the default entity vector length
* update comment
2020-08-04 14:34:09 +02:00
Adriane Boyd
b7e3018d97
Recalculate alignment if tokenization differs ( #5868 )
...
* Recalculate alignment if tokenization differs
* Refactor cached alignment data
2020-08-04 14:31:32 +02:00
Ines Montani
934447a611
Merge pull request #5855 from svlandeg/fix/cli-debug
2020-08-03 13:09:20 +02:00
Ines Montani
4c055f0aa7
Add init CLI and init config ( #5854 )
...
* Add init CLI and init config draft
* Improve config validation
* Auto-format
* Don't export anything in debug config
* Update docs
2020-08-02 15:18:30 +02:00
svlandeg
6f4e46ee93
Merge remote-tracking branch 'upstream/develop' into fix/cli-debug
...
# Conflicts:
# pyproject.toml
# requirements.txt
# setup.cfg
2020-08-01 18:38:59 +02:00
Ines Montani
b40f44419b
Simplify pipe analysis
...
- remove unused code
- don't print by default
- integrate attrs info into analysis output
2020-08-01 13:40:06 +02:00
Ines Montani
b68c53858c
Remove global
2020-07-31 18:37:58 +02:00
Ines Montani
30a76fcf6f
Integrate and simplify pipe analysis
2020-07-31 18:34:35 +02:00
svlandeg
9b719dfb1a
use divider inbetween steps
2020-07-31 18:06:48 +02:00
svlandeg
51ffc4a166
rename pipe_name to component
2020-07-31 17:58:55 +02:00
svlandeg
878327d38e
printing final predictions by default to False
2020-07-31 17:36:32 +02:00
Ines Montani
2d955fbf98
Fix linting [ci skip]
2020-07-31 17:05:28 +02:00
Ines Montani
e9e8fa2466
Update docs and types
2020-07-31 17:02:54 +02:00
svlandeg
cc2f58a1b0
use data_validation context manager
2020-07-31 16:49:42 +02:00
svlandeg
5fa3235d06
set DATA_VALIDATION to False for debug_model (upgrade thinc)
2020-07-31 15:21:01 +02:00
svlandeg
08d3c36c20
bugfix in train CLI
2020-07-31 15:03:43 +02:00
Adriane Boyd
9b509aa87f
Move Language.evaluate scorer config to new arg
...
Move `Language.evaluate` scorer config from `component_cfg` to separate
argument `scorer_cfg`.
2020-07-31 11:05:16 +02:00
Adriane Boyd
901801b33b
Fix default arguments in DependencyParser.score
2020-07-31 10:55:44 +02:00
Adriane Boyd
9d79916792
Merge branch 'develop' into feature/scorer-adjustments
2020-07-31 10:48:14 +02:00
Sofie Van Landeghem
ca491722ad
The Parser is now a Pipe (2) ( #5844 )
...
* moving syntax folder to _parser_internals
* moving nn_parser and transition_system
* move nn_parser and transition_system out of internals folder
* moving nn_parser code into transition_system file
* rename transition_system to transition_parser
* moving parser_model and _state to ml
* move _state back to internals
* The Parser now inherits from Pipe!
* small code fixes
* removing unnecessary imports
* remove link_vectors_to_models
* transition_system to internals folder
* little bit more cleanup
* newlines
2020-07-30 23:30:54 +02:00
svlandeg
0b23594953
pipe_name instead of section in debug_model
2020-07-30 20:06:28 +02:00
Ines Montani
7a21775cd0
Merge pull request #5834 from explosion/feature/vectors
2020-07-29 18:49:26 +02:00
Ines Montani
b0f57a0cac
Update docs and consistency
2020-07-29 15:14:07 +02:00
Matthew Honnibal
a2d573c039
Merge branch 'feature/vectors' of https://github.com/explosion/spaCy into feature/vectors
2020-07-29 14:56:27 +02:00
Matthew Honnibal
2af741d7e3
Fix train arg
2020-07-29 14:56:01 +02:00
Matthew Honnibal
c27309f839
Merge branch 'develop' into feature/vectors
2020-07-29 14:54:10 +02:00
Ines Montani
62266fb828
Fix broken type annotation
2020-07-29 14:49:49 +02:00
Matthew Honnibal
142b58be92
Fix import
2020-07-29 14:45:09 +02:00
Matthew Honnibal
c99a653070
Adjust textcat model
2020-07-29 14:38:15 +02:00
Matthew Honnibal
9e1b11dd81
Update vectors in textcat
2020-07-29 14:35:36 +02:00
Matthew Honnibal
105cf29967
Fix DocBin
2020-07-29 14:23:13 +02:00
Ines Montani
ff0bc05da8
Fix docstrings [ci skip]
2020-07-29 14:09:37 +02:00
Ines Montani
6e2623d3f8
Fix docstring [ci skip]
2020-07-29 14:08:05 +02:00
Ines Montani
8d56260d92
Fix docstrings [ci skip]
2020-07-29 14:07:13 +02:00
Ines Montani
80b18124d2
Fix docstring [ci skip]
2020-07-29 14:03:35 +02:00
Matthew Honnibal
f0cf4a2dca
Update tests
2020-07-29 14:01:14 +02:00
Matthew Honnibal
07b47eaac8
Update tok2vec layer
2020-07-29 14:01:13 +02:00
Matthew Honnibal
5ae8628571
Fix CharacterEmbed layer
2020-07-29 14:01:13 +02:00
Matthew Honnibal
97d3651574
Fix stray link_vectors_to_models call
2020-07-29 14:01:13 +02:00
Matthew Honnibal
c7d1ece3eb
Update tests
2020-07-29 14:01:13 +02:00
Matthew Honnibal
00de30bcc2
Update CharacterEmbed function
2020-07-29 14:01:12 +02:00
Matthew Honnibal
6a6b09bd32
Update morphologizer model
2020-07-29 14:01:12 +02:00
Matthew Honnibal
20e9098e3f
Update tests
2020-07-29 14:01:12 +02:00
Matthew Honnibal
c35d6282fc
Add previous HashEmbedCNN tok2vec to make transition easier
2020-07-29 14:01:12 +02:00
Matthew Honnibal
1784c95827
Clean up link_vectors_to_models unused stuff
2020-07-29 14:01:11 +02:00
Matthew Honnibal
0c17ea4c85
Format
2020-07-29 14:00:13 +02:00
Matthew Honnibal
2aff3c4b5a
Load vectors in 'spacy train'
2020-07-29 14:00:13 +02:00
Matthew Honnibal
7852a68a75
Fix load_vectors_into_model function
2020-07-29 14:00:13 +02:00
Matthew Honnibal
7299419fe4
Dont load vectors in Language.from_config
2020-07-29 14:00:12 +02:00
Matthew Honnibal
30dd96c540
Load vectors in Language.from_config
2020-07-29 14:00:12 +02:00
Matthew Honnibal
df95e2af64
Add load_vectors_into_model util
2020-07-29 14:00:12 +02:00
Matthew Honnibal
475d7c1c7c
Fix StaticVectors class
2020-07-29 14:00:11 +02:00
Matthew Honnibal
44d350dc94
Use spaCy's StaticVectors
2020-07-29 14:00:11 +02:00
Matthew Honnibal
acc64e138a
Add import
2020-07-29 14:00:11 +02:00
Matthew Honnibal
9987ea9e4d
Fix Tok2Vec begin_training
2020-07-29 14:00:10 +02:00
Matthew Honnibal
099e9331c5
Fix tok2vec
2020-07-29 14:00:10 +02:00
Matthew Honnibal
fe0cdcd461
Fixes
2020-07-29 14:00:09 +02:00
Matthew Honnibal
123f8b832d
Refactor Tok2Vec model
2020-07-29 14:00:09 +02:00
Matthew Honnibal
c6b4f63c7c
Remove obsolete function
2020-07-29 14:00:09 +02:00
Matthew Honnibal
9cc7262224
Draft StaticVectors layer
2020-07-29 14:00:09 +02:00
Matthew Honnibal
cb9654e98c
WIP on new StaticVectors
2020-07-29 14:00:09 +02:00
Ines Montani
e257e66ab9
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2020-07-29 11:36:45 +02:00
Ines Montani
e0ffe36e79
Update docstrings, docs and types
2020-07-29 11:36:42 +02:00
Sofie Van Landeghem
40c995b1be
Option for returning only greedy matches ( #5771 )
...
* add "greedy" option for match pattern
* distinction between greedy FIRST or LONGEST
* check for proper values, throw custom warning otherwise
* unxfail one more test
* add comment in docstring
* add test that LONGEST also prefers first match if equal length
* use c arrays for more efficient processing
* rename 'greediness' to 'greedy'
2020-07-29 11:04:43 +02:00
Adriane Boyd
191a12d75f
Fix score_weights typo in train CLI ( #5835 )
2020-07-29 11:04:12 +02:00
Adriane Boyd
0cddb0dbe9
Move timing into Language.evaluate ( #5836 )
...
Move timing into `Language.evaluate` so that only the processing is
timing, not processing + scoring. `Language.evaluate` returns
`scores["speed"]` as words per second, which should be identical to how
the speed was added to the scores previously. Also add the speed to the
evaluate CLI output.
2020-07-29 11:02:31 +02:00
Adriane Boyd
c689ae8f0a
Fix types in Scorer
2020-07-29 10:40:30 +02:00
Ines Montani
7adffc5361
Remove unused schema
2020-07-28 23:12:47 +02:00
Ines Montani
e5d9eaf79c
Tidy up docstrings and arguments
2020-07-28 23:12:42 +02:00
Ines Montani
ac24adec73
Small adjustments to Scorer and docs
2020-07-28 21:39:42 +02:00
Ines Montani
2c7a32cf12
Remove unused methods
2020-07-28 16:50:02 +02:00
Ines Montani
ba22111ff4
Move error to Errors
2020-07-28 16:24:14 +02:00
Ines Montani
2748249217
Re-add meta["pipeline"] for now
2020-07-28 16:14:23 +02:00
Ines Montani
b83ead5bf5
Merge pull request #5824 from svlandeg/fix/textcat-v3
2020-07-28 15:04:25 +02:00
Ines Montani
06a97a8766
Support --opt=value format in CLI config overrides
2020-07-28 13:43:15 +02:00
Ines Montani
ae4d8a6ffd
Update docstrings, docs and pipe consistency
2020-07-28 13:37:31 +02:00
Ines Montani
0094cb0d04
Remove scores list from config and document
2020-07-28 11:22:24 +02:00
Ines Montani
894e20c466
Merge branch 'develop' into feature/component-scores
2020-07-27 18:14:39 +02:00
Ines Montani
d8b519c23c
API docs, docstrings and argument consistency
2020-07-27 18:11:45 +02:00
svlandeg
85b2dcfd67
cleanup
2020-07-27 17:54:44 +02:00
svlandeg
61068e0fb1
util function dot_to_object and corresponding unit test
2020-07-27 17:50:12 +02:00
Ines Montani
10b84e1e27
Add flag to toggle sdist creation on package [ci skip]
2020-07-27 16:52:23 +02:00
Adriane Boyd
34c92dfe63
Add missing Scorer imports
2020-07-27 15:08:51 +02:00
Adriane Boyd
8bb0507777
Add and update score methods and score weights
...
Add and update `score` methods, provided `scores`, and default weights
`default_score_weights` for pipeline components.
* `scores` provides all top-level keys returned by `score` (merely informative, similar to `assigns`).
* `default_score_weights` provides the default weights for a default config.
* The keys from `default_score_weights` determine which values will be
shown in the `spacy train` output, so keys with weight `0.0` will be
displayed but not counted toward the overall score.
2020-07-27 14:44:53 +02:00
Adriane Boyd
baf19fd652
Update cats scoring to provide overall score
...
* Provide top-level score as `attr_score`
* Provide a description of the score as `attr_score_desc`
* Provide all potential scores keys, setting unused keys to `None`
* Update CLI evaluate accordingly
2020-07-27 12:26:10 +02:00
Adriane Boyd
f8cf378be9
Combine weights from multiple components
...
Combine weights from multiple components for the same score.
2020-07-27 10:21:31 +02:00
Ines Montani
3d56a3f286
Make more args keyword-only
2020-07-27 00:27:53 +02:00
Matthew Honnibal
80271ac0ba
Update default config
2020-07-26 15:27:39 +02:00
Ines Montani
ed61fb10fc
Rename default textcat arch to TextCatEnsemble
2020-07-26 15:11:43 +02:00
Ines Montani
53d37da29a
Make sure @factories is removed from config
2020-07-26 15:11:24 +02:00
Ines Montani
4060c2d5a6
Fix test
2020-07-26 13:40:19 +02:00
Ines Montani
2470486543
Allow pipeline components to set default scores and weights
2020-07-26 13:18:43 +02:00
Ines Montani
787d066e22
Remove pipes.pyx
...
Probably accidentally re-added in a merge?
2020-07-26 13:08:52 +02:00
Matthew Honnibal
520d25cb50
Add smart_open dependency to fetch project assets ( #5812 )
...
* Use smart_open for project assets
* Fix assets.py
* Update pyproject.toml
2020-07-26 12:15:00 +02:00
Ines Montani
e92df281ce
Tidy up, autoformat, add types
2020-07-25 15:01:15 +02:00
Matthew Honnibal
71242327b2
Set version to v3.0.0a5
2020-07-25 14:06:01 +02:00
Ines Montani
cdbd6ba912
Merge pull request #5798 from explosion/feature/language-data-config
2020-07-25 13:34:49 +02:00
Ines Montani
49f27a2a7b
Tidy up [ci skip]
2020-07-25 13:00:49 +02:00
Ines Montani
4a0a692875
Add missing lex_attr_getters ( resolves #5806 )
2020-07-25 12:55:18 +02:00