Commit Graph

7479 Commits

Author SHA1 Message Date
Ines Montani
4ca08c6d5d
Merge pull request #5891 from adrianeboyd/docs/attribute-ruler-api
Add AttributeRuler API docs
2020-08-07 13:55:12 +02:00
Adriane Boyd
b8d0c23857 Add AttributeRuler API docs
With additional minor updates to AttributeRuler docstrings.
2020-08-07 12:43:23 +02:00
svlandeg
b17db0e994 Merge remote-tracking branch 'upstream/develop' into feature/el-docs
# Conflicts:
#	website/docs/usage/training.md
2020-08-06 19:48:52 +02:00
Adriane Boyd
06c3a5e048
Add pipe to AttributeRuler (#5889) 2020-08-06 19:43:09 +02:00
Ines Montani
9b7f198390 Fix format 2020-08-06 19:30:53 +02:00
Ines Montani
3c4389110d Remove unused imports 2020-08-06 19:30:47 +02:00
Matthew Honnibal
d4525816ef
Be less choosy about reporting textcat scores (#5879)
* Set textcat scores more consistently

* Refactor textcat scores

* Fixes to scorer

* Add comments

* Add threshold

* Rename just 'f' to micro_f in textcat scorer

* Fix textcat score for two-class

* Fix syntax

* Fix textcat score

* Fix docstring
2020-08-06 16:24:13 +02:00
svlandeg
0b4d1e1bc4 'debug data' instead of 'debug-data' 2020-08-06 15:47:31 +02:00
svlandeg
881e3f8fd0 add docbin explanation and example 2020-08-06 15:29:44 +02:00
Adriane Boyd
5e683a6e46
Fix return values for per feat score (#5885)
* Fix return values for per feat score

Convert `PRFScore` to dict as other per type scores.

* Update tests accordingly
2020-08-06 15:14:47 +02:00
Ines Montani
913d21f0a3
Merge pull request #5882 from explosion/feature/raise-from
Use "raise ... from" in custom errors for better tracebacks
2020-08-06 00:35:26 +02:00
Ines Montani
06e80d95cd
Sync develop with nightly docs state (#5883)
Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>
2020-08-06 00:28:14 +02:00
Ines Montani
d92954ac1d
Merge pull request #5881 from explosion/feature/better-error-model-shortcuts 2020-08-06 00:13:35 +02:00
Ines Montani
56c17973aa Use "raise ... from" in custom errors for better tracebacks 2020-08-05 23:53:21 +02:00
Ines Montani
5cc0d89fad
Simplify config overrides in CLI and deserialization (#5880) 2020-08-05 23:35:09 +02:00
Ines Montani
0881455a5d Update error message 2020-08-05 23:15:05 +02:00
Ines Montani
2a1fa86a0d Add better error for failed model shortcut loading 2020-08-05 23:10:29 +02:00
Ines Montani
c675746ca2 Update docstrings and types 2020-08-05 20:29:46 +02:00
Ines Montani
823e533dc1
Add config callbacks for modifying nlp object before and after init (#5866)
* WIP: Concept for modifying nlp object before and after init

* Make callbacks return nlp object

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>

* Raise if callbacks don't return correct type

* Rename, update types, add after_pipeline_creation

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2020-08-05 19:47:54 +02:00
Ines Montani
586d695775 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2020-08-05 16:01:11 +02:00
Ines Montani
e68459296d Tidy up and auto-format 2020-08-05 16:00:59 +02:00
Matthew Honnibal
50c0e49741 Fix train CLI 2020-08-05 15:40:47 +02:00
Matthew Honnibal
b9df4d6116 Fix textcat.begin_training if vectors set 2020-08-05 15:40:36 +02:00
Adriane Boyd
af125875cf
Update SimpleNER (#5878)
* Fix `get_loss` to use NER annotation
* Add labels as part of cfg
* Add simple overfitting test
2020-08-05 14:43:29 +02:00
Sofie Van Landeghem
b88c5c701a
Bugfix in nlp.replace_pipe (#5875)
* bugfix and unit test

* merge two conditions
2020-08-05 09:30:58 +02:00
Ines Montani
b795f02fbd
Allow adding pipeline components from source model (#5857)
* Allow adding pipeline components from source model

* Config: name -> component

* Improve error messages

* Fix error and test

* Add frozen components and exclude logic

* Remove exclude from Language.evaluate

* Init sourced components with current vocab

* Fix error codes
2020-08-04 23:39:19 +02:00
Sofie Van Landeghem
34873c4911
Example Dict format consistency (#5858)
* consistently use upper-case IDS in token_annotation format and for get_aligned

* remove ID from to_dict (not used in from_dict either)

* fix test

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2020-08-04 22:22:26 +02:00
Adriane Boyd
fa79a0db9f
Add AttributeRuler for token attribute exceptions (#5842)
* Add AttributeRuler for token attribute exceptions

Add the `AttributeRuler` to handle exceptions for token-level
attributes. The `AttributeRuler` uses `Matcher` patterns to identify
target spans and applies the specified attributes to the token at the
provided index in the matched span. A negative index can be used to
index from the end of the matched span. The retokenizer is used to
"merge" the individual tokens and assign them the provided attributes.

Helper functions can import existing tag maps and morph rules to the
corresponding `Matcher` patterns.

There is an additional minor bug fix for `MORPH` attributes in the
retokenizer to correctly normalize the values and to handle `MORPH`
alongside `_` in an attrs dict.

* Fix default name

* Update name in error message

* Extend AttributeRuler functionality

* Add option to initialize with a dict of AttributeRuler patterns

* Instead of silently discarding overlapping matches (the default
behavior for the retokenizer if only the attrs differ), split the
matches into disjoint sets and retokenize each set separately. This
allows, for instance, one pattern to set the POS and another pattern to
set the lemma. (If two matches modify the same attribute, it looks like
the attrs are applied in the order they were added, but it may not be
deterministic?)

* Improve types

* Sort spans before processing

* Fix index boundaries in Span

* Refactor retokenizer to separate attrs methods

Add top-level `normalize_token_attrs` and `set_token_attrs` methods.

* Update AttributeRuler to use refactored methods

Update `AttributeRuler` to replace use of full retokenizer with only the
relevant methods for normalizing and setting attributes for a single
token.

* Update spacy/pipeline/attributeruler.py

Co-authored-by: Ines Montani <ines@ines.io>

* Make API more similar to EntityRuler

* Add `AttributeRuler.add_patterns` to add patterns from a list of dicts
* Return list of dicts as property `AttributeRuler.patterns`

* Make attrs_unnormed private

* Add test loading patterns from assets

* Revert "Fix index boundaries in Span"

This reverts commit 8f8a5c3386.

* Add Span index boundary checks (#5861)

* Add Span index boundary checks

* Return Span-specific IndexError in all cases

* Simplify and fix if/else

Co-authored-by: Ines Montani <ines@ines.io>
2020-08-04 17:02:39 +02:00
Sofie Van Landeghem
492d1ec5de
Prevent alignment when texts don't match (#5867)
* remove empty gold.pyx

* add alignment unit test (to be used in docs)

* ensure that Alignment is only used on equal texts

* additional test using example.alignment

* formatting

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2020-08-04 16:29:18 +02:00
Matthew Honnibal
ecb3c4e8f4
Create corpus iterator and batcher from registry during training (#5865)
* Move batchers into their own module (and registry)

* Update CLI

* Update Corpus and batcher

* Update tests

* Update one config

* Merge 'evaluation' block back under [training]

* Import batchers in gold __init__

* Fix batchers

* Update config

* Update schema

* Update util

* Don't assume train and dev are actually paths

* Update onto-joint config

* Fix missing import

* Format

* Format

* Update spacy/gold/corpus.py

Co-authored-by: Ines Montani <ines@ines.io>

* Fix name

* Update default config

* Fix get_length option in batchers

* Update test

* Add comment

* Pass path into Corpus

* Update docstring

* Update schema and configs

* Update config

* Fix test

* Fix paths

* Fix print

* Fix create_train_batches

* [training.read_train] -> [training.train_corpus]

* Update onto-joint config

Co-authored-by: Ines Montani <ines@ines.io>
2020-08-04 15:09:37 +02:00
Sofie Van Landeghem
82347110f5
Default empty KB in EL component (#5872)
* EL field documentation

* documentation consistent with docs

* default empty KB, initialize vocab separately

* formatting

* add test for changing the default entity vector length

* update comment
2020-08-04 14:34:09 +02:00
Adriane Boyd
b7e3018d97
Recalculate alignment if tokenization differs (#5868)
* Recalculate alignment if tokenization differs

* Refactor cached alignment data
2020-08-04 14:31:32 +02:00
Ines Montani
934447a611
Merge pull request #5855 from svlandeg/fix/cli-debug 2020-08-03 13:09:20 +02:00
Ines Montani
4c055f0aa7
Add init CLI and init config (#5854)
* Add init CLI and init config draft

* Improve config validation

* Auto-format

* Don't export anything in debug config

* Update docs
2020-08-02 15:18:30 +02:00
svlandeg
6f4e46ee93 Merge remote-tracking branch 'upstream/develop' into fix/cli-debug
# Conflicts:
#	pyproject.toml
#	requirements.txt
#	setup.cfg
2020-08-01 18:38:59 +02:00
Ines Montani
b40f44419b Simplify pipe analysis
- remove unused code
- don't print by default
- integrate attrs info into analysis output
2020-08-01 13:40:06 +02:00
Ines Montani
b68c53858c Remove global 2020-07-31 18:37:58 +02:00
Ines Montani
30a76fcf6f Integrate and simplify pipe analysis 2020-07-31 18:34:35 +02:00
svlandeg
9b719dfb1a use divider inbetween steps 2020-07-31 18:06:48 +02:00
svlandeg
51ffc4a166 rename pipe_name to component 2020-07-31 17:58:55 +02:00
svlandeg
878327d38e printing final predictions by default to False 2020-07-31 17:36:32 +02:00
Ines Montani
2d955fbf98 Fix linting [ci skip] 2020-07-31 17:05:28 +02:00
Ines Montani
e9e8fa2466 Update docs and types 2020-07-31 17:02:54 +02:00
svlandeg
cc2f58a1b0 use data_validation context manager 2020-07-31 16:49:42 +02:00
svlandeg
5fa3235d06 set DATA_VALIDATION to False for debug_model (upgrade thinc) 2020-07-31 15:21:01 +02:00
svlandeg
08d3c36c20 bugfix in train CLI 2020-07-31 15:03:43 +02:00
Adriane Boyd
9b509aa87f Move Language.evaluate scorer config to new arg
Move `Language.evaluate` scorer config from `component_cfg` to separate
argument `scorer_cfg`.
2020-07-31 11:05:16 +02:00
Adriane Boyd
901801b33b Fix default arguments in DependencyParser.score 2020-07-31 10:55:44 +02:00
Adriane Boyd
9d79916792 Merge branch 'develop' into feature/scorer-adjustments 2020-07-31 10:48:14 +02:00
Sofie Van Landeghem
ca491722ad
The Parser is now a Pipe (2) (#5844)
* moving syntax folder to _parser_internals

* moving nn_parser and transition_system

* move nn_parser and transition_system out of internals folder

* moving nn_parser code into transition_system file

* rename transition_system to transition_parser

* moving parser_model and _state to ml

* move _state back to internals

* The Parser now inherits from Pipe!

* small code fixes

* removing unnecessary imports

* remove link_vectors_to_models

* transition_system to internals folder

* little bit more cleanup

* newlines
2020-07-30 23:30:54 +02:00
svlandeg
0b23594953 pipe_name instead of section in debug_model 2020-07-30 20:06:28 +02:00
Ines Montani
7a21775cd0
Merge pull request #5834 from explosion/feature/vectors 2020-07-29 18:49:26 +02:00
Ines Montani
b0f57a0cac Update docs and consistency 2020-07-29 15:14:07 +02:00
Matthew Honnibal
a2d573c039 Merge branch 'feature/vectors' of https://github.com/explosion/spaCy into feature/vectors 2020-07-29 14:56:27 +02:00
Matthew Honnibal
2af741d7e3 Fix train arg 2020-07-29 14:56:01 +02:00
Matthew Honnibal
c27309f839
Merge branch 'develop' into feature/vectors 2020-07-29 14:54:10 +02:00
Ines Montani
62266fb828 Fix broken type annotation 2020-07-29 14:49:49 +02:00
Matthew Honnibal
142b58be92 Fix import 2020-07-29 14:45:09 +02:00
Matthew Honnibal
c99a653070 Adjust textcat model 2020-07-29 14:38:15 +02:00
Matthew Honnibal
9e1b11dd81 Update vectors in textcat 2020-07-29 14:35:36 +02:00
Matthew Honnibal
105cf29967 Fix DocBin 2020-07-29 14:23:13 +02:00
Ines Montani
ff0bc05da8 Fix docstrings [ci skip] 2020-07-29 14:09:37 +02:00
Ines Montani
6e2623d3f8 Fix docstring [ci skip] 2020-07-29 14:08:05 +02:00
Ines Montani
8d56260d92 Fix docstrings [ci skip] 2020-07-29 14:07:13 +02:00
Ines Montani
80b18124d2 Fix docstring [ci skip] 2020-07-29 14:03:35 +02:00
Matthew Honnibal
f0cf4a2dca Update tests 2020-07-29 14:01:14 +02:00
Matthew Honnibal
07b47eaac8 Update tok2vec layer 2020-07-29 14:01:13 +02:00
Matthew Honnibal
5ae8628571 Fix CharacterEmbed layer 2020-07-29 14:01:13 +02:00
Matthew Honnibal
97d3651574 Fix stray link_vectors_to_models call 2020-07-29 14:01:13 +02:00
Matthew Honnibal
c7d1ece3eb Update tests 2020-07-29 14:01:13 +02:00
Matthew Honnibal
00de30bcc2 Update CharacterEmbed function 2020-07-29 14:01:12 +02:00
Matthew Honnibal
6a6b09bd32 Update morphologizer model 2020-07-29 14:01:12 +02:00
Matthew Honnibal
20e9098e3f Update tests 2020-07-29 14:01:12 +02:00
Matthew Honnibal
c35d6282fc Add previous HashEmbedCNN tok2vec to make transition easier 2020-07-29 14:01:12 +02:00
Matthew Honnibal
1784c95827 Clean up link_vectors_to_models unused stuff 2020-07-29 14:01:11 +02:00
Matthew Honnibal
0c17ea4c85 Format 2020-07-29 14:00:13 +02:00
Matthew Honnibal
2aff3c4b5a Load vectors in 'spacy train' 2020-07-29 14:00:13 +02:00
Matthew Honnibal
7852a68a75 Fix load_vectors_into_model function 2020-07-29 14:00:13 +02:00
Matthew Honnibal
7299419fe4 Dont load vectors in Language.from_config 2020-07-29 14:00:12 +02:00
Matthew Honnibal
30dd96c540 Load vectors in Language.from_config 2020-07-29 14:00:12 +02:00
Matthew Honnibal
df95e2af64 Add load_vectors_into_model util 2020-07-29 14:00:12 +02:00
Matthew Honnibal
475d7c1c7c Fix StaticVectors class 2020-07-29 14:00:11 +02:00
Matthew Honnibal
44d350dc94 Use spaCy's StaticVectors 2020-07-29 14:00:11 +02:00
Matthew Honnibal
acc64e138a Add import 2020-07-29 14:00:11 +02:00
Matthew Honnibal
9987ea9e4d Fix Tok2Vec begin_training 2020-07-29 14:00:10 +02:00
Matthew Honnibal
099e9331c5 Fix tok2vec 2020-07-29 14:00:10 +02:00
Matthew Honnibal
fe0cdcd461 Fixes 2020-07-29 14:00:09 +02:00
Matthew Honnibal
123f8b832d Refactor Tok2Vec model 2020-07-29 14:00:09 +02:00
Matthew Honnibal
c6b4f63c7c Remove obsolete function 2020-07-29 14:00:09 +02:00
Matthew Honnibal
9cc7262224 Draft StaticVectors layer 2020-07-29 14:00:09 +02:00
Matthew Honnibal
cb9654e98c WIP on new StaticVectors 2020-07-29 14:00:09 +02:00
Ines Montani
e257e66ab9 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2020-07-29 11:36:45 +02:00
Ines Montani
e0ffe36e79 Update docstrings, docs and types 2020-07-29 11:36:42 +02:00
Sofie Van Landeghem
40c995b1be
Option for returning only greedy matches (#5771)
* add "greedy" option for match pattern

* distinction between greedy FIRST or LONGEST

* check for proper values, throw custom warning otherwise

* unxfail one more test

* add comment in docstring

* add test that LONGEST also prefers first match if equal length

* use c arrays for more efficient processing

* rename 'greediness' to 'greedy'
2020-07-29 11:04:43 +02:00
Adriane Boyd
191a12d75f
Fix score_weights typo in train CLI (#5835) 2020-07-29 11:04:12 +02:00
Adriane Boyd
0cddb0dbe9
Move timing into Language.evaluate (#5836)
Move timing into `Language.evaluate` so that only the processing is
timing, not processing + scoring. `Language.evaluate` returns
`scores["speed"]` as words per second, which should be identical to how
the speed was added to the scores previously. Also add the speed to the
evaluate CLI output.
2020-07-29 11:02:31 +02:00
Adriane Boyd
c689ae8f0a Fix types in Scorer 2020-07-29 10:40:30 +02:00
Ines Montani
7adffc5361 Remove unused schema 2020-07-28 23:12:47 +02:00
Ines Montani
e5d9eaf79c Tidy up docstrings and arguments 2020-07-28 23:12:42 +02:00
Ines Montani
ac24adec73 Small adjustments to Scorer and docs 2020-07-28 21:39:42 +02:00