Commit Graph

356 Commits

Author SHA1 Message Date
Sofie Van Landeghem
744df9814a
define threshold for scoring textcat in TextCat config (#6055)
* define threshold for scoring textcat in TextCat config

* fix unit test and documentation
2020-09-13 14:15:52 +02:00
Sofie Van Landeghem
cb66ea7400
Remove simple_ner code (#6041)
* remove simple_ner code

* remove unused _biluo and _iob files
2020-09-09 16:11:27 +02:00
Sofie Van Landeghem
8e7557656f
Renaming gold & annotation_setter (#6042)
* version bump to 3.0.0a16

* rename "gold" folder to "training"

* rename 'annotation_setter' to 'set_extra_annotations'

* formatting
2020-09-09 10:31:03 +02:00
Sofie Van Landeghem
60f22e1800
Pipe API (#6034)
* ensure Language passes on valid examples for initialization

* fix tagger model initialization

* check for valid get_examples across components

* assume labels were added before begin_training

* fix senter initialization

* fix morphologizer initialization

* use methods to check arguments

* test textcat init, requires thinc>=8.0.0a31

* fix tok2vec init

* fix entity linker init

* use islice

* fix simple NER

* cleanup debug model

* fix assert statements

* fix tests

* throw error when adding a label if the output layer can't be resized anymore

* fix test

* add failing test for simple_ner

* UX improvements

* morphologizer UX

* assume begin_training gets a representative set and processes the labels

* remove assumptions for output of untrained NER model

* restore test for original purpose
2020-09-08 22:44:25 +02:00
Matthew Honnibal
dae22f3dfa Fix ignoring of punct labels 2020-09-05 14:11:59 +02:00
Ines Montani
864a697e63 Merge branch 'develop' into master-tmp 2020-09-04 13:15:36 +02:00
Ines Montani
ab1bb421ed Update docs links in codebase 2020-09-04 12:58:50 +02:00
Ines Montani
5afe6447cd registry.assets -> registry.misc 2020-09-03 17:31:14 +02:00
Matthew Honnibal
737a1408d9 Improve implementation of fix #6010
Follow-ups to the parser efficiency fix.

* Avoid introducing new counter for number of pushes
* Base cut on number of transitions, keeping it more even
* Reintroduce the randomization we had in v2.
2020-09-02 14:42:32 +02:00
Matthew Honnibal
c1bf3a5602
Fix significant performance bug in parser training (#6010)
The parser training makes use of a trick for long documents, where we
use the oracle to cut up the document into sections, so that we can have
batch items in the middle of a document. For instance, if we have one
document of 600 words, we might make 6 states, starting at words 0, 100,
200, 300, 400 and 500.

The problem is for v3, I screwed this up and didn't stop parsing! So
instead of a batch of [100, 100, 100, 100, 100, 100], we'd have a batch
of [600, 500, 400, 300, 200, 100]. Oops.

The implementation here could probably be improved, it's annoying to
have this extra variable in the state. But this'll do.

This makes the v3 parser training 5-10 times faster, depending on document
lengths. This problem wasn't in v2.
2020-09-02 12:57:13 +02:00
Matthew Honnibal
4cce32f090 Fix tagger initialization 2020-09-01 16:38:34 +02:00
Adriane Boyd
9130094199
Prevent Tagger model init with 0 labels (#5984)
* Prevent Tagger model init with 0 labels

Raise an error before trying to initialize a tagger model with 0 labels.

* Add dummy tagger label for test

* Remove tagless tagger model initializiation

* Fix error number after merge

* Add dummy tagger label to test

* Fix formatting

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2020-08-31 21:24:33 +02:00
Sofie Van Landeghem
ec14744ee4
Rename Transformer listener (#6001)
* rename to spacy-transformers.TransformerListener

* add some more tok2vec tests

* use select_pipes

* fix docs - annotation setter was not changed in the end
2020-08-31 12:41:39 +02:00
Ines Montani
45f46a5c85
Merge pull request #5993 from explosion/feature/disabled-components 2020-08-29 15:58:41 +02:00
Ines Montani
34146750d4 Use frozen list with custom errors
We don't want to break backwards compatibility too much but we also want to provide the best possible UX
2020-08-29 15:20:11 +02:00
Ines Montani
2bc31e15c9 Tidy up and auto-format [ci skip] 2020-08-29 13:01:10 +02:00
Ines Montani
f45095a666
Merge pull request #5995 from adrianeboyd/bugfix/attribute-ruler-bugfixes 2020-08-29 12:38:30 +02:00
Matthew Honnibal
58f19421b1 Return empty batch from tok2vec listener if no doc.tensor 2020-08-29 03:46:50 +02:00
Adriane Boyd
0104bd1600 Sort the AttributeRuler matches by rule order
Sort the returned matches by rule order (the `match_id`) so that the
rules are applied in the order they were added. This is necessary, for
instance, if the `AttributeRuler` is used for the tag map and later
rules require POS tags.
2020-08-28 21:01:06 +02:00
Adriane Boyd
8674b17651 Serialize AttributeRuler.patterns
Serialize `AttributeRuler.patterns` instead of the individual lists to
simplify the serialized and so that patterns are reloaded exactly as
they were originally provided (preserving `_attrs_unnormed`).
2020-08-28 20:44:45 +02:00
Matthew Honnibal
d3ffe4ca63 Fix error when tagger was initialized with no labels 2020-08-27 18:56:58 +02:00
Matthew Honnibal
95adb58f15 Force tagger to pass batch of docs into model in begin_training 2020-08-27 03:21:03 +02:00
Adriane Boyd
90d88729e0
Add AttributeRuler.score (#5963)
* Add AttributeRuler.score

Add scoring for TAG / POS / MORPH / LEMMA if these are present in the
assigned token attributes.

Add default score weights (that don't really make a lot of sense) so
that the scores are in the default config in some form.

* Update docs
2020-08-26 15:39:30 +02:00
Sofie Van Landeghem
79d460e3a2
Weights & Biases logger for train CLI (#5971)
* quick test as part of train script

* train_logger in config, default ConsoleLogger in loggers catalogue

* entitiy typo

* add wandb_logger

* cleanup

* Update spacy/cli/train_logger.py

Co-authored-by: Ines Montani <ines@ines.io>

* move loggers to gold.loggers

Co-authored-by: Ines Montani <ines@ines.io>
2020-08-26 15:24:33 +02:00
Sofie Van Landeghem
358cbb21e3
Define candidate generator in EL config (#5876)
* candidate generator as separate part of EL config

* update comment

* ent instead of str as input for candidate generation

* Span instead of str: correct type indication

* fix types

* unit test to create new candidate generator

* fix replace_pipe argument passing

* move error message, general cleanup

* add vocab back to KB constructor

* provide KB as callable from Vocab arg

* rename to kb_loader, fix KB serialization as part of the EL pipe

* fix typo

* reformatting

* cleanup

* fix comment

* fix wrongly duplicated code from merge conflict

* rename dump to to_disk

* from_disk instead of load_bulk

* update test after recent removal of set_morphology in tagger

* remove old doc
2020-08-18 16:10:36 +02:00
Ines Montani
8128e5eb35 Replace lexeme_norm warning with logging 2020-08-14 15:00:52 +02:00
Ines Montani
e4d0990857 Only receive from listener if listener exists 2020-08-14 14:58:48 +02:00
Adam Bittlingmayer
7b33b2854f
Add Armenian sentence-final verchaket, Greek question mark and Arabic question mark to default punct (#5910)
* Add Armenian sentence-final verchaket

* Add Greek and Arabic question marks, and contributor agreement

* Check box
2020-08-12 15:36:14 +02:00
graue70
49e690bde1
Fix typos in comments (#5904)
* Fix typo in comment

* Fix typo

* Add spaCy Contributor Agreement
2020-08-12 15:35:25 +02:00
graue70
ba84371ab0
Use init parameter (#5909) 2020-08-11 23:41:58 +02:00
Ines Montani
950832f087
Tidy up pipes (#5906)
* Tidy up pipes

* Fix init, defaults and raise custom errors

* Update docs

* Update docs [ci skip]

* Apply suggestions from code review

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>

* Tidy up error handling and validation, fix consistency

* Simplify get_examples check

* Remove unused import [ci skip]

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2020-08-11 23:29:31 +02:00
Ines Montani
f79e4c094d Remove generic type
Seems to cause error on Python 3.8 with Cython?
2020-08-10 17:24:30 +02:00
Ines Montani
64f2f84098 Update docstrings and docs [ci skip] 2020-08-10 13:45:22 +02:00
Ines Montani
a4b448eec4 Remove unused compiler flag 2020-08-10 13:13:18 +02:00
Ines Montani
3eaeb73342 Tidy up and auto-format 2020-08-09 22:36:23 +02:00
Ines Montani
d5c78c7a34 Update docs and fix consistency 2020-08-09 22:31:52 +02:00
Ines Montani
7c6854d8d4 Fix missing imports 2020-08-09 22:28:29 +02:00
Ines Montani
a15c5fb191 Update docstrings and docs 2020-08-09 16:10:48 +02:00
Matthew Honnibal
134d933d67 Add docstring for entity linker factory 2020-08-09 15:19:28 +02:00
Matthew Honnibal
992ee1c02f Update tagger docstring 2020-08-09 15:09:31 +02:00
Matthew Honnibal
ebf9a7acbf Add textcat docstring 2020-08-09 15:07:09 +02:00
Matthew Honnibal
bbd8acd4bf Add docstrings for parser and NER. Simplify some arguments 2020-08-09 14:46:13 +02:00
Matthew Honnibal
39a3d64c01 Add docstrings for Tok2Vec component 2020-08-09 00:48:03 +02:00
Ines Montani
fe29ceec9e Merge branch 'develop' into docs/model-docstrings 2020-08-07 18:42:01 +02:00
Ines Montani
3a193eb8f1 Fix imports, types and default configs 2020-08-07 18:40:54 +02:00
Ines Montani
6f3649923c
Merge pull request #5893 from explosion/feature/validate-arg 2020-08-07 15:47:20 +02:00
Adriane Boyd
e962784531
Add Lemmatizer and simplify related components (#5848)
* Add Lemmatizer and simplify related components

* Add `Lemmatizer` pipe with `lookup` and `rule` modes using the
`Lookups` tables.
* Reduce `Tagger` to a simple tagger that sets `Token.tag` (no pos or lemma)
* Reduce `Morphology` to only keep track of morph tags (no tag map, lemmatizer,
or morph rules)
* Remove lemmatizer from `Vocab`
* Adjust many many tests

Differences:

* No default lookup lemmas
* No special treatment of TAG in `from_array` and similar required
* Easier to modify labels in a `Tagger`
* No extra strings added from morphology / tag map

* Fix test

* Initial fix for Lemmatizer config/serialization

* Adjust init test to be more generic

* Adjust init test to force empty Lookups

* Add simple cache to rule-based lemmatizer

* Convert language-specific lemmatizers

Convert language-specific lemmatizers to component lemmatizers. Remove
previous lemmatizer class.

* Fix French and Polish lemmatizers

* Remove outdated UPOS conversions

* Update Russian lemmatizer init in tests

* Add minimal init/run tests for custom lemmatizers

* Add option to overwrite existing lemmas

* Update mode setting, lookup loading, and caching

* Make `mode` an immutable property
* Only enforce strict `load_lookups` for known supported modes
* Move caching into individual `_lemmatize` methods

* Implement strict when lang is not found in lookups

* Fix tables/lookups in make_lemmatizer

* Reallow provided lookups and allow for stricter checks

* Add lookups asset to all Lemmatizer pipe tests

* Rename lookups in lemmatizer init test

* Clean up merge

* Refactor lookup table loading

* Add helper from `load_lemmatizer_lookups` that loads required and
optional lookups tables based on settings provided by a config.

Additional slight refactor of lookups:

* Add `Lookups.set_table` to set a table from a provided `Table`
* Reorder class definitions to be able to specify type as `Table`

* Move registry assets into test methods

* Refactor lookups tables config

Use class methods within `Lemmatizer` to provide the config for
particular modes and to load the lookups from a config.

* Add pipe and score to lemmatizer

* Simplify Tagger.score

* Add missing import

* Clean up imports and auto-format

* Remove unused kwarg

* Tidy up and auto-format

* Update docstrings for Lemmatizer

Update docstrings for Lemmatizer.

Additionally modify `is_base_form` API to take `Token` instead of
individual features.

* Update docstrings

* Remove tag map values from Tagger.add_label

* Update API docs

* Fix relative link in Lemmatizer API docs
2020-08-07 15:27:13 +02:00
Ines Montani
fc9a4fe827 Update attribute ruler 2020-08-07 14:43:55 +02:00
Ines Montani
a8404c3517 validation -> validate 2020-08-07 14:43:47 +02:00
Adriane Boyd
b8d0c23857 Add AttributeRuler API docs
With additional minor updates to AttributeRuler docstrings.
2020-08-07 12:43:23 +02:00
Adriane Boyd
06c3a5e048
Add pipe to AttributeRuler (#5889) 2020-08-06 19:43:09 +02:00
Ines Montani
9b7f198390 Fix format 2020-08-06 19:30:53 +02:00
Ines Montani
56c17973aa Use "raise ... from" in custom errors for better tracebacks 2020-08-05 23:53:21 +02:00
Ines Montani
586d695775 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2020-08-05 16:01:11 +02:00
Ines Montani
e68459296d Tidy up and auto-format 2020-08-05 16:00:59 +02:00
Matthew Honnibal
b9df4d6116 Fix textcat.begin_training if vectors set 2020-08-05 15:40:36 +02:00
Adriane Boyd
af125875cf
Update SimpleNER (#5878)
* Fix `get_loss` to use NER annotation
* Add labels as part of cfg
* Add simple overfitting test
2020-08-05 14:43:29 +02:00
Sofie Van Landeghem
34873c4911
Example Dict format consistency (#5858)
* consistently use upper-case IDS in token_annotation format and for get_aligned

* remove ID from to_dict (not used in from_dict either)

* fix test

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2020-08-04 22:22:26 +02:00
Adriane Boyd
fa79a0db9f
Add AttributeRuler for token attribute exceptions (#5842)
* Add AttributeRuler for token attribute exceptions

Add the `AttributeRuler` to handle exceptions for token-level
attributes. The `AttributeRuler` uses `Matcher` patterns to identify
target spans and applies the specified attributes to the token at the
provided index in the matched span. A negative index can be used to
index from the end of the matched span. The retokenizer is used to
"merge" the individual tokens and assign them the provided attributes.

Helper functions can import existing tag maps and morph rules to the
corresponding `Matcher` patterns.

There is an additional minor bug fix for `MORPH` attributes in the
retokenizer to correctly normalize the values and to handle `MORPH`
alongside `_` in an attrs dict.

* Fix default name

* Update name in error message

* Extend AttributeRuler functionality

* Add option to initialize with a dict of AttributeRuler patterns

* Instead of silently discarding overlapping matches (the default
behavior for the retokenizer if only the attrs differ), split the
matches into disjoint sets and retokenize each set separately. This
allows, for instance, one pattern to set the POS and another pattern to
set the lemma. (If two matches modify the same attribute, it looks like
the attrs are applied in the order they were added, but it may not be
deterministic?)

* Improve types

* Sort spans before processing

* Fix index boundaries in Span

* Refactor retokenizer to separate attrs methods

Add top-level `normalize_token_attrs` and `set_token_attrs` methods.

* Update AttributeRuler to use refactored methods

Update `AttributeRuler` to replace use of full retokenizer with only the
relevant methods for normalizing and setting attributes for a single
token.

* Update spacy/pipeline/attributeruler.py

Co-authored-by: Ines Montani <ines@ines.io>

* Make API more similar to EntityRuler

* Add `AttributeRuler.add_patterns` to add patterns from a list of dicts
* Return list of dicts as property `AttributeRuler.patterns`

* Make attrs_unnormed private

* Add test loading patterns from assets

* Revert "Fix index boundaries in Span"

This reverts commit 8f8a5c3386.

* Add Span index boundary checks (#5861)

* Add Span index boundary checks

* Return Span-specific IndexError in all cases

* Simplify and fix if/else

Co-authored-by: Ines Montani <ines@ines.io>
2020-08-04 17:02:39 +02:00
Sofie Van Landeghem
82347110f5
Default empty KB in EL component (#5872)
* EL field documentation

* documentation consistent with docs

* default empty KB, initialize vocab separately

* formatting

* add test for changing the default entity vector length

* update comment
2020-08-04 14:34:09 +02:00
Adriane Boyd
ac14ce7c30
Prefer earlier spans in EntityRuler (#5843)
Similar to #4414, update the sorting in EntityRuler to prefer the first
span in overlapping spans.
2020-07-31 16:09:32 +02:00
Adriane Boyd
901801b33b Fix default arguments in DependencyParser.score 2020-07-31 10:55:44 +02:00
Sofie Van Landeghem
ca491722ad
The Parser is now a Pipe (2) (#5844)
* moving syntax folder to _parser_internals

* moving nn_parser and transition_system

* move nn_parser and transition_system out of internals folder

* moving nn_parser code into transition_system file

* rename transition_system to transition_parser

* moving parser_model and _state to ml

* move _state back to internals

* The Parser now inherits from Pipe!

* small code fixes

* removing unnecessary imports

* remove link_vectors_to_models

* transition_system to internals folder

* little bit more cleanup

* newlines
2020-07-30 23:30:54 +02:00
Ines Montani
7a21775cd0
Merge pull request #5834 from explosion/feature/vectors 2020-07-29 18:49:26 +02:00
Ines Montani
b0f57a0cac Update docs and consistency 2020-07-29 15:14:07 +02:00
Matthew Honnibal
c27309f839
Merge branch 'develop' into feature/vectors 2020-07-29 14:54:10 +02:00
Ines Montani
ff0bc05da8 Fix docstrings [ci skip] 2020-07-29 14:09:37 +02:00
Ines Montani
6e2623d3f8 Fix docstring [ci skip] 2020-07-29 14:08:05 +02:00
Ines Montani
8d56260d92 Fix docstrings [ci skip] 2020-07-29 14:07:13 +02:00
Ines Montani
80b18124d2 Fix docstring [ci skip] 2020-07-29 14:03:35 +02:00
Matthew Honnibal
6a6b09bd32 Update morphologizer model 2020-07-29 14:01:12 +02:00
Matthew Honnibal
1784c95827 Clean up link_vectors_to_models unused stuff 2020-07-29 14:01:11 +02:00
Matthew Honnibal
9987ea9e4d Fix Tok2Vec begin_training 2020-07-29 14:00:10 +02:00
Ines Montani
e257e66ab9 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2020-07-29 11:36:45 +02:00
Ines Montani
e0ffe36e79 Update docstrings, docs and types 2020-07-29 11:36:42 +02:00
Sofie Van Landeghem
40c995b1be
Option for returning only greedy matches (#5771)
* add "greedy" option for match pattern

* distinction between greedy FIRST or LONGEST

* check for proper values, throw custom warning otherwise

* unxfail one more test

* add comment in docstring

* add test that LONGEST also prefers first match if equal length

* use c arrays for more efficient processing

* rename 'greediness' to 'greedy'
2020-07-29 11:04:43 +02:00
Ines Montani
2c7a32cf12 Remove unused methods 2020-07-28 16:50:02 +02:00
Ines Montani
ae4d8a6ffd Update docstrings, docs and pipe consistency 2020-07-28 13:37:31 +02:00
Ines Montani
894e20c466 Merge branch 'develop' into feature/component-scores 2020-07-27 18:14:39 +02:00
Ines Montani
d8b519c23c API docs, docstrings and argument consistency 2020-07-27 18:11:45 +02:00
Adriane Boyd
34c92dfe63 Add missing Scorer imports 2020-07-27 15:08:51 +02:00
Adriane Boyd
8bb0507777 Add and update score methods and score weights
Add and update `score` methods, provided `scores`, and default weights
`default_score_weights` for pipeline components.

* `scores` provides all top-level keys returned by `score` (merely informative, similar to `assigns`).
* `default_score_weights` provides the default weights for a default config.
* The keys from `default_score_weights` determine which values will be
shown in the `spacy train` output, so keys with weight `0.0` will be
displayed but not counted toward the overall score.
2020-07-27 14:44:53 +02:00
Ines Montani
ed61fb10fc Rename default textcat arch to TextCatEnsemble 2020-07-26 15:11:43 +02:00
Ines Montani
2470486543 Allow pipeline components to set default scores and weights 2020-07-26 13:18:43 +02:00
Ines Montani
787d066e22 Remove pipes.pyx
Probably accidentally re-added in a merge?
2020-07-26 13:08:52 +02:00
Ines Montani
e92df281ce Tidy up, autoformat, add types 2020-07-25 15:01:15 +02:00
Ines Montani
cdbd6ba912
Merge pull request #5798 from explosion/feature/language-data-config 2020-07-25 13:34:49 +02:00
Adriane Boyd
2bcceb80c4
Refactor the Scorer to improve flexibility (#5731)
* Refactor the Scorer to improve flexibility

Refactor the `Scorer` to improve flexibility for arbitrary pipeline
components.

* Individual pipeline components provide their own `evaluate` methods
that score a list of `Example`s and return a dictionary of scores
* `Scorer` is initialized either:
  * with a provided pipeline containing components to be scored
  * with a default pipeline containing the built-in statistical
    components (senter, tagger, morphologizer, parser, ner)
* `Scorer.score` evaluates a list of `Example`s and returns a dictionary
of scores referring to the scores provided by the components in the
pipeline

Significant differences:

* `tags_acc` is renamed to `tag_acc` to be consistent with `token_acc`
and the new `morph_acc`, `pos_acc`, and `lemma_acc`
* Scoring is no longer cumulative: `Scorer.score` scores a list of
examples rather than a single example and does not retain any state
about previously scored examples
* PRF values in the returned scores are no longer multiplied by 100

* Add kwargs to Morphologizer.evaluate

* Create generalized scoring methods in Scorer

* Generalized static scoring methods are added to `Scorer`
  * Methods require an attribute (either on Token or Doc) that is
used to key the returned scores

Naming differences:

* `uas`, `las`, and `las_per_type` in the scores dict are renamed to
`dep_uas`, `dep_las`, and `dep_las_per_type`

Scoring differences:

* `Doc.sents` is now scored as spans rather than on sentence-initial
token positions so that `Doc.sents` and `Doc.ents` can be scored with
the same method (this lowers scores since a single incorrect sentence
start results in two incorrect spans)

* Simplify / extend hasattr check for eval method

* Add hasattr check to tokenizer scoring
* Simplify to hasattr check for component scoring

* Reset Example alignment if docs are set

Reset the Example alignment if either doc is set in case the
tokenization has changed.

* Add PRF tokenization scoring for tokens as spans

Add PRF scores for tokens as character spans. The scores are:

* token_acc: # correct tokens / # gold tokens
* token_p/r/f: PRF for (token.idx, token.idx + len(token))

* Add docstring to Scorer.score_tokenization

* Rename component.evaluate() to component.score()

* Update Scorer API docs

* Update scoring for positive_label in textcat

* Fix TextCategorizer.score kwargs

* Update Language.evaluate docs

* Update score names in default config
2020-07-25 12:53:02 +02:00
Ines Montani
b9aaa4e457 Improve vocab data integration and warning 2020-07-25 11:51:30 +02:00
Adriane Boyd
038ff1a811
Improve warnings around normalization tables (#5794)
Provide more customized normalization table warnings when training a new
model. Only suggest installing `spacy-lookups-data` if it's not already
installed and it includes a table for this language (currently checked
in a hard-coded list).
2020-07-22 16:04:58 +02:00
Ines Montani
43b960c01b
Refactor pipeline components, config and language data (#5759)
* Update with WIP

* Update with WIP

* Update with pipeline serialization

* Update types and pipe factories

* Add deep merge, tidy up and add tests

* Fix pipe creation from config

* Don't validate default configs on load

* Update spacy/language.py

Co-authored-by: Ines Montani <ines@ines.io>

* Adjust factory/component meta error

* Clean up factory args and remove defaults

* Add test for failing empty dict defaults

* Update pipeline handling and methods

* provide KB as registry function instead of as object

* small change in test to make functionality more clear

* update example script for EL configuration

* Fix typo

* Simplify test

* Simplify test

* splitting pipes.pyx into separate files

* moving default configs to each component file

* fix batch_size type

* removing default values from component constructors where possible (TODO: test 4725)

* skip instead of xfail

* Add test for config -> nlp with multiple instances

* pipeline.pipes -> pipeline.pipe

* Tidy up, document, remove kwargs

* small cleanup/generalization for Tok2VecListener

* use DEFAULT_UPSTREAM field

* revert to avoid circular imports

* Fix tests

* Replace deprecated arg

* Make model dirs require config

* fix pickling of keyword-only arguments in constructor

* WIP: clean up and integrate full config

* Add helper to handle function args more reliably

Now also includes keyword-only args

* Fix config composition and serialization

* Improve config debugging and add visual diff

* Remove unused defaults and fix type

* Remove pipeline and factories from meta

* Update spacy/default_config.cfg

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update spacy/default_config.cfg

* small UX edits

* avoid printing stack trace for debug CLI commands

* Add support for language-specific factories

* specify the section of the config which holds the model to debug

* WIP: add Language.from_config

* Update with language data refactor WIP

* Auto-format

* Add backwards-compat handling for Language.factories

* Update morphologizer.pyx

* Fix morphologizer

* Update and simplify lemmatizers

* Fix Japanese tests

* Port over tagger changes

* Fix Chinese and tests

* Update to latest Thinc

* WIP: xfail first Russian lemmatizer test

* Fix component-specific overrides

* fix nO for output layers in debug_model

* Fix default value

* Fix tests and don't pass objects in config

* Fix deep merging

* Fix lemma lookup data registry

Only load the lookups if an entry is available in the registry (and if spacy-lookups-data is installed)

* Add types

* Add Vocab.from_config

* Fix typo

* Fix tests

* Make config copying more elegant

* Fix pipe analysis

* Fix lemmatizers and is_base_form

* WIP: move language defaults to config

* Fix morphology type

* Fix vocab

* Remove comment

* Update to latest Thinc

* Add morph rules to config

* Tidy up

* Remove set_morphology option from tagger factory

* Hack use_gpu

* Move [pipeline] to top-level block and make [nlp.pipeline] list

Allows separating component blocks from component order – otherwise, ordering the config would mean a changed component order, which is bad. Also allows initial config to define more components and not use all of them

* Fix use_gpu and resume in CLI

* Auto-format

* Remove resume from config

* Fix formatting and error

* [pipeline] -> [components]

* Fix types

* Fix tagger test: requires set_morphology?

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>
Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2020-07-22 13:42:59 +02:00
Ines Montani
644074b954 Merge branch 'develop' into master-tmp 2020-07-20 14:58:04 +02:00
Ines Montani
796f6c52d1 Merge branch 'develop' into pr/5767 2020-07-19 13:37:46 +02:00
Adriane Boyd
b81a89f0a9
Update morphologizer (#5766)
* update `Morphologizer.begin_training` for use with `Example`

* make init and begin_training more consistent

* add `Morphology.normalize_features` to normalize outside of
`Morphology.add`

* make sure `get_loss` doesn't create unknown labels when the POS and
morph alignments differ
2020-07-19 11:10:51 +02:00
Adriane Boyd
50db3f0cdb Serialize morph rules with tagger
Serialize `morph_rules` with the tagger alongside the `tag_map`.

Use `Morphology.load_tag_map` and `Morphology.load_morph_exceptions` to
load these settings rather than reinitializing the morphology each time
they are changed.
2020-07-17 08:22:21 +02:00
Ines Montani
5f6f4ff594 Remove object subclassing 2020-07-12 14:03:23 +02:00
Sofie Van Landeghem
dd207a28be
cleanup components API (#5726)
* add keyword separator for update functions and drop unused "state"

* few more Example tests and various small fixes

* consistently return losses after update call

* eliminate unused tensors field across pipe components

* fix name

* fix arg name
2020-07-09 19:43:39 +02:00
Adriane Boyd
ad15499b3b
Fix get_loss for values outside of labels in senter (#5730)
* Fix get_loss for None alignments in senter

When converting the `sent_start` values back to `SentenceRecognizer`
labels, handle `None` alignments.

* Handle SENT_START as -1

Handle SENT_START as -1 (or -1 converted to uint64) by treating any
values other than 1 the same as 0 in `SentenceRecognizer.get_loss`.
2020-07-09 01:41:58 +02:00
Adriane Boyd
c9f0f75778
Update get_loss for senter and morphologizer (#5724)
* Update get_loss for senter

Update `SentenceRecognizer.get_loss` to keep it similar to `Tagger`.

* Update get_loss for morphologizer

Update `Morphologizer.get_loss` to keep it similar to `Tagger`.
2020-07-08 13:59:28 +02:00
Matthw Honnibal
a4164f67ca Don't normalize gradients 2020-07-07 17:21:58 +02:00