Commit Graph

7457 Commits

Author SHA1 Message Date
Matthew Honnibal
b9df4d6116 Fix textcat.begin_training if vectors set 2020-08-05 15:40:36 +02:00
Adriane Boyd
af125875cf
Update SimpleNER (#5878)
* Fix `get_loss` to use NER annotation
* Add labels as part of cfg
* Add simple overfitting test
2020-08-05 14:43:29 +02:00
Sofie Van Landeghem
b88c5c701a
Bugfix in nlp.replace_pipe (#5875)
* bugfix and unit test

* merge two conditions
2020-08-05 09:30:58 +02:00
Ines Montani
b795f02fbd
Allow adding pipeline components from source model (#5857)
* Allow adding pipeline components from source model

* Config: name -> component

* Improve error messages

* Fix error and test

* Add frozen components and exclude logic

* Remove exclude from Language.evaluate

* Init sourced components with current vocab

* Fix error codes
2020-08-04 23:39:19 +02:00
Sofie Van Landeghem
34873c4911
Example Dict format consistency (#5858)
* consistently use upper-case IDS in token_annotation format and for get_aligned

* remove ID from to_dict (not used in from_dict either)

* fix test

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2020-08-04 22:22:26 +02:00
Adriane Boyd
fa79a0db9f
Add AttributeRuler for token attribute exceptions (#5842)
* Add AttributeRuler for token attribute exceptions

Add the `AttributeRuler` to handle exceptions for token-level
attributes. The `AttributeRuler` uses `Matcher` patterns to identify
target spans and applies the specified attributes to the token at the
provided index in the matched span. A negative index can be used to
index from the end of the matched span. The retokenizer is used to
"merge" the individual tokens and assign them the provided attributes.

Helper functions can import existing tag maps and morph rules to the
corresponding `Matcher` patterns.

There is an additional minor bug fix for `MORPH` attributes in the
retokenizer to correctly normalize the values and to handle `MORPH`
alongside `_` in an attrs dict.

* Fix default name

* Update name in error message

* Extend AttributeRuler functionality

* Add option to initialize with a dict of AttributeRuler patterns

* Instead of silently discarding overlapping matches (the default
behavior for the retokenizer if only the attrs differ), split the
matches into disjoint sets and retokenize each set separately. This
allows, for instance, one pattern to set the POS and another pattern to
set the lemma. (If two matches modify the same attribute, it looks like
the attrs are applied in the order they were added, but it may not be
deterministic?)

* Improve types

* Sort spans before processing

* Fix index boundaries in Span

* Refactor retokenizer to separate attrs methods

Add top-level `normalize_token_attrs` and `set_token_attrs` methods.

* Update AttributeRuler to use refactored methods

Update `AttributeRuler` to replace use of full retokenizer with only the
relevant methods for normalizing and setting attributes for a single
token.

* Update spacy/pipeline/attributeruler.py

Co-authored-by: Ines Montani <ines@ines.io>

* Make API more similar to EntityRuler

* Add `AttributeRuler.add_patterns` to add patterns from a list of dicts
* Return list of dicts as property `AttributeRuler.patterns`

* Make attrs_unnormed private

* Add test loading patterns from assets

* Revert "Fix index boundaries in Span"

This reverts commit 8f8a5c3386.

* Add Span index boundary checks (#5861)

* Add Span index boundary checks

* Return Span-specific IndexError in all cases

* Simplify and fix if/else

Co-authored-by: Ines Montani <ines@ines.io>
2020-08-04 17:02:39 +02:00
Sofie Van Landeghem
492d1ec5de
Prevent alignment when texts don't match (#5867)
* remove empty gold.pyx

* add alignment unit test (to be used in docs)

* ensure that Alignment is only used on equal texts

* additional test using example.alignment

* formatting

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2020-08-04 16:29:18 +02:00
Matthew Honnibal
ecb3c4e8f4
Create corpus iterator and batcher from registry during training (#5865)
* Move batchers into their own module (and registry)

* Update CLI

* Update Corpus and batcher

* Update tests

* Update one config

* Merge 'evaluation' block back under [training]

* Import batchers in gold __init__

* Fix batchers

* Update config

* Update schema

* Update util

* Don't assume train and dev are actually paths

* Update onto-joint config

* Fix missing import

* Format

* Format

* Update spacy/gold/corpus.py

Co-authored-by: Ines Montani <ines@ines.io>

* Fix name

* Update default config

* Fix get_length option in batchers

* Update test

* Add comment

* Pass path into Corpus

* Update docstring

* Update schema and configs

* Update config

* Fix test

* Fix paths

* Fix print

* Fix create_train_batches

* [training.read_train] -> [training.train_corpus]

* Update onto-joint config

Co-authored-by: Ines Montani <ines@ines.io>
2020-08-04 15:09:37 +02:00
Sofie Van Landeghem
82347110f5
Default empty KB in EL component (#5872)
* EL field documentation

* documentation consistent with docs

* default empty KB, initialize vocab separately

* formatting

* add test for changing the default entity vector length

* update comment
2020-08-04 14:34:09 +02:00
Adriane Boyd
b7e3018d97
Recalculate alignment if tokenization differs (#5868)
* Recalculate alignment if tokenization differs

* Refactor cached alignment data
2020-08-04 14:31:32 +02:00
Ines Montani
934447a611
Merge pull request #5855 from svlandeg/fix/cli-debug 2020-08-03 13:09:20 +02:00
Ines Montani
4c055f0aa7
Add init CLI and init config (#5854)
* Add init CLI and init config draft

* Improve config validation

* Auto-format

* Don't export anything in debug config

* Update docs
2020-08-02 15:18:30 +02:00
svlandeg
6f4e46ee93 Merge remote-tracking branch 'upstream/develop' into fix/cli-debug
# Conflicts:
#	pyproject.toml
#	requirements.txt
#	setup.cfg
2020-08-01 18:38:59 +02:00
Ines Montani
b40f44419b Simplify pipe analysis
- remove unused code
- don't print by default
- integrate attrs info into analysis output
2020-08-01 13:40:06 +02:00
Ines Montani
b68c53858c Remove global 2020-07-31 18:37:58 +02:00
Ines Montani
30a76fcf6f Integrate and simplify pipe analysis 2020-07-31 18:34:35 +02:00
svlandeg
9b719dfb1a use divider inbetween steps 2020-07-31 18:06:48 +02:00
svlandeg
51ffc4a166 rename pipe_name to component 2020-07-31 17:58:55 +02:00
svlandeg
878327d38e printing final predictions by default to False 2020-07-31 17:36:32 +02:00
Ines Montani
2d955fbf98 Fix linting [ci skip] 2020-07-31 17:05:28 +02:00
Ines Montani
e9e8fa2466 Update docs and types 2020-07-31 17:02:54 +02:00
svlandeg
cc2f58a1b0 use data_validation context manager 2020-07-31 16:49:42 +02:00
svlandeg
5fa3235d06 set DATA_VALIDATION to False for debug_model (upgrade thinc) 2020-07-31 15:21:01 +02:00
svlandeg
08d3c36c20 bugfix in train CLI 2020-07-31 15:03:43 +02:00
Adriane Boyd
9b509aa87f Move Language.evaluate scorer config to new arg
Move `Language.evaluate` scorer config from `component_cfg` to separate
argument `scorer_cfg`.
2020-07-31 11:05:16 +02:00
Adriane Boyd
901801b33b Fix default arguments in DependencyParser.score 2020-07-31 10:55:44 +02:00
Adriane Boyd
9d79916792 Merge branch 'develop' into feature/scorer-adjustments 2020-07-31 10:48:14 +02:00
Sofie Van Landeghem
ca491722ad
The Parser is now a Pipe (2) (#5844)
* moving syntax folder to _parser_internals

* moving nn_parser and transition_system

* move nn_parser and transition_system out of internals folder

* moving nn_parser code into transition_system file

* rename transition_system to transition_parser

* moving parser_model and _state to ml

* move _state back to internals

* The Parser now inherits from Pipe!

* small code fixes

* removing unnecessary imports

* remove link_vectors_to_models

* transition_system to internals folder

* little bit more cleanup

* newlines
2020-07-30 23:30:54 +02:00
svlandeg
0b23594953 pipe_name instead of section in debug_model 2020-07-30 20:06:28 +02:00
Ines Montani
7a21775cd0
Merge pull request #5834 from explosion/feature/vectors 2020-07-29 18:49:26 +02:00
Ines Montani
b0f57a0cac Update docs and consistency 2020-07-29 15:14:07 +02:00
Matthew Honnibal
a2d573c039 Merge branch 'feature/vectors' of https://github.com/explosion/spaCy into feature/vectors 2020-07-29 14:56:27 +02:00
Matthew Honnibal
2af741d7e3 Fix train arg 2020-07-29 14:56:01 +02:00
Matthew Honnibal
c27309f839
Merge branch 'develop' into feature/vectors 2020-07-29 14:54:10 +02:00
Ines Montani
62266fb828 Fix broken type annotation 2020-07-29 14:49:49 +02:00
Matthew Honnibal
142b58be92 Fix import 2020-07-29 14:45:09 +02:00
Matthew Honnibal
c99a653070 Adjust textcat model 2020-07-29 14:38:15 +02:00
Matthew Honnibal
9e1b11dd81 Update vectors in textcat 2020-07-29 14:35:36 +02:00
Matthew Honnibal
105cf29967 Fix DocBin 2020-07-29 14:23:13 +02:00
Ines Montani
ff0bc05da8 Fix docstrings [ci skip] 2020-07-29 14:09:37 +02:00
Ines Montani
6e2623d3f8 Fix docstring [ci skip] 2020-07-29 14:08:05 +02:00
Ines Montani
8d56260d92 Fix docstrings [ci skip] 2020-07-29 14:07:13 +02:00
Ines Montani
80b18124d2 Fix docstring [ci skip] 2020-07-29 14:03:35 +02:00
Matthew Honnibal
f0cf4a2dca Update tests 2020-07-29 14:01:14 +02:00
Matthew Honnibal
07b47eaac8 Update tok2vec layer 2020-07-29 14:01:13 +02:00
Matthew Honnibal
5ae8628571 Fix CharacterEmbed layer 2020-07-29 14:01:13 +02:00
Matthew Honnibal
97d3651574 Fix stray link_vectors_to_models call 2020-07-29 14:01:13 +02:00
Matthew Honnibal
c7d1ece3eb Update tests 2020-07-29 14:01:13 +02:00
Matthew Honnibal
00de30bcc2 Update CharacterEmbed function 2020-07-29 14:01:12 +02:00
Matthew Honnibal
6a6b09bd32 Update morphologizer model 2020-07-29 14:01:12 +02:00
Matthew Honnibal
20e9098e3f Update tests 2020-07-29 14:01:12 +02:00
Matthew Honnibal
c35d6282fc Add previous HashEmbedCNN tok2vec to make transition easier 2020-07-29 14:01:12 +02:00
Matthew Honnibal
1784c95827 Clean up link_vectors_to_models unused stuff 2020-07-29 14:01:11 +02:00
Matthew Honnibal
0c17ea4c85 Format 2020-07-29 14:00:13 +02:00
Matthew Honnibal
2aff3c4b5a Load vectors in 'spacy train' 2020-07-29 14:00:13 +02:00
Matthew Honnibal
7852a68a75 Fix load_vectors_into_model function 2020-07-29 14:00:13 +02:00
Matthew Honnibal
7299419fe4 Dont load vectors in Language.from_config 2020-07-29 14:00:12 +02:00
Matthew Honnibal
30dd96c540 Load vectors in Language.from_config 2020-07-29 14:00:12 +02:00
Matthew Honnibal
df95e2af64 Add load_vectors_into_model util 2020-07-29 14:00:12 +02:00
Matthew Honnibal
475d7c1c7c Fix StaticVectors class 2020-07-29 14:00:11 +02:00
Matthew Honnibal
44d350dc94 Use spaCy's StaticVectors 2020-07-29 14:00:11 +02:00
Matthew Honnibal
acc64e138a Add import 2020-07-29 14:00:11 +02:00
Matthew Honnibal
9987ea9e4d Fix Tok2Vec begin_training 2020-07-29 14:00:10 +02:00
Matthew Honnibal
099e9331c5 Fix tok2vec 2020-07-29 14:00:10 +02:00
Matthew Honnibal
fe0cdcd461 Fixes 2020-07-29 14:00:09 +02:00
Matthew Honnibal
123f8b832d Refactor Tok2Vec model 2020-07-29 14:00:09 +02:00
Matthew Honnibal
c6b4f63c7c Remove obsolete function 2020-07-29 14:00:09 +02:00
Matthew Honnibal
9cc7262224 Draft StaticVectors layer 2020-07-29 14:00:09 +02:00
Matthew Honnibal
cb9654e98c WIP on new StaticVectors 2020-07-29 14:00:09 +02:00
Ines Montani
e257e66ab9 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2020-07-29 11:36:45 +02:00
Ines Montani
e0ffe36e79 Update docstrings, docs and types 2020-07-29 11:36:42 +02:00
Sofie Van Landeghem
40c995b1be
Option for returning only greedy matches (#5771)
* add "greedy" option for match pattern

* distinction between greedy FIRST or LONGEST

* check for proper values, throw custom warning otherwise

* unxfail one more test

* add comment in docstring

* add test that LONGEST also prefers first match if equal length

* use c arrays for more efficient processing

* rename 'greediness' to 'greedy'
2020-07-29 11:04:43 +02:00
Adriane Boyd
191a12d75f
Fix score_weights typo in train CLI (#5835) 2020-07-29 11:04:12 +02:00
Adriane Boyd
0cddb0dbe9
Move timing into Language.evaluate (#5836)
Move timing into `Language.evaluate` so that only the processing is
timing, not processing + scoring. `Language.evaluate` returns
`scores["speed"]` as words per second, which should be identical to how
the speed was added to the scores previously. Also add the speed to the
evaluate CLI output.
2020-07-29 11:02:31 +02:00
Adriane Boyd
c689ae8f0a Fix types in Scorer 2020-07-29 10:40:30 +02:00
Ines Montani
7adffc5361 Remove unused schema 2020-07-28 23:12:47 +02:00
Ines Montani
e5d9eaf79c Tidy up docstrings and arguments 2020-07-28 23:12:42 +02:00
Ines Montani
ac24adec73 Small adjustments to Scorer and docs 2020-07-28 21:39:42 +02:00
Ines Montani
2c7a32cf12 Remove unused methods 2020-07-28 16:50:02 +02:00
Ines Montani
ba22111ff4 Move error to Errors 2020-07-28 16:24:14 +02:00
Ines Montani
2748249217 Re-add meta["pipeline"] for now 2020-07-28 16:14:23 +02:00
Ines Montani
b83ead5bf5
Merge pull request #5824 from svlandeg/fix/textcat-v3 2020-07-28 15:04:25 +02:00
Ines Montani
06a97a8766 Support --opt=value format in CLI config overrides 2020-07-28 13:43:15 +02:00
Ines Montani
ae4d8a6ffd Update docstrings, docs and pipe consistency 2020-07-28 13:37:31 +02:00
Ines Montani
0094cb0d04 Remove scores list from config and document 2020-07-28 11:22:24 +02:00
Ines Montani
894e20c466 Merge branch 'develop' into feature/component-scores 2020-07-27 18:14:39 +02:00
Ines Montani
d8b519c23c API docs, docstrings and argument consistency 2020-07-27 18:11:45 +02:00
svlandeg
85b2dcfd67 cleanup 2020-07-27 17:54:44 +02:00
svlandeg
61068e0fb1 util function dot_to_object and corresponding unit test 2020-07-27 17:50:12 +02:00
Ines Montani
10b84e1e27 Add flag to toggle sdist creation on package [ci skip] 2020-07-27 16:52:23 +02:00
Adriane Boyd
34c92dfe63 Add missing Scorer imports 2020-07-27 15:08:51 +02:00
Adriane Boyd
8bb0507777 Add and update score methods and score weights
Add and update `score` methods, provided `scores`, and default weights
`default_score_weights` for pipeline components.

* `scores` provides all top-level keys returned by `score` (merely informative, similar to `assigns`).
* `default_score_weights` provides the default weights for a default config.
* The keys from `default_score_weights` determine which values will be
shown in the `spacy train` output, so keys with weight `0.0` will be
displayed but not counted toward the overall score.
2020-07-27 14:44:53 +02:00
Adriane Boyd
baf19fd652 Update cats scoring to provide overall score
* Provide top-level score as `attr_score`
* Provide a description of the score as `attr_score_desc`
* Provide all potential scores keys, setting unused keys to `None`
* Update CLI evaluate accordingly
2020-07-27 12:26:10 +02:00
Adriane Boyd
f8cf378be9 Combine weights from multiple components
Combine weights from multiple components for the same score.
2020-07-27 10:21:31 +02:00
Ines Montani
3d56a3f286 Make more args keyword-only 2020-07-27 00:27:53 +02:00
Matthew Honnibal
80271ac0ba Update default config 2020-07-26 15:27:39 +02:00
Ines Montani
ed61fb10fc Rename default textcat arch to TextCatEnsemble 2020-07-26 15:11:43 +02:00
Ines Montani
53d37da29a Make sure @factories is removed from config 2020-07-26 15:11:24 +02:00
Ines Montani
4060c2d5a6 Fix test 2020-07-26 13:40:19 +02:00
Ines Montani
2470486543 Allow pipeline components to set default scores and weights 2020-07-26 13:18:43 +02:00