Commit Graph

7506 Commits

Author SHA1 Message Date
Adriane Boyd
af125875cf
Update SimpleNER (#5878)
* Fix `get_loss` to use NER annotation
* Add labels as part of cfg
* Add simple overfitting test
2020-08-05 14:43:29 +02:00
Sofie Van Landeghem
b88c5c701a
Bugfix in nlp.replace_pipe (#5875)
* bugfix and unit test

* merge two conditions
2020-08-05 09:30:58 +02:00
Ines Montani
b795f02fbd
Allow adding pipeline components from source model (#5857)
* Allow adding pipeline components from source model

* Config: name -> component

* Improve error messages

* Fix error and test

* Add frozen components and exclude logic

* Remove exclude from Language.evaluate

* Init sourced components with current vocab

* Fix error codes
2020-08-04 23:39:19 +02:00
Sofie Van Landeghem
34873c4911
Example Dict format consistency (#5858)
* consistently use upper-case IDS in token_annotation format and for get_aligned

* remove ID from to_dict (not used in from_dict either)

* fix test

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2020-08-04 22:22:26 +02:00
Adriane Boyd
fa79a0db9f
Add AttributeRuler for token attribute exceptions (#5842)
* Add AttributeRuler for token attribute exceptions

Add the `AttributeRuler` to handle exceptions for token-level
attributes. The `AttributeRuler` uses `Matcher` patterns to identify
target spans and applies the specified attributes to the token at the
provided index in the matched span. A negative index can be used to
index from the end of the matched span. The retokenizer is used to
"merge" the individual tokens and assign them the provided attributes.

Helper functions can import existing tag maps and morph rules to the
corresponding `Matcher` patterns.

There is an additional minor bug fix for `MORPH` attributes in the
retokenizer to correctly normalize the values and to handle `MORPH`
alongside `_` in an attrs dict.

* Fix default name

* Update name in error message

* Extend AttributeRuler functionality

* Add option to initialize with a dict of AttributeRuler patterns

* Instead of silently discarding overlapping matches (the default
behavior for the retokenizer if only the attrs differ), split the
matches into disjoint sets and retokenize each set separately. This
allows, for instance, one pattern to set the POS and another pattern to
set the lemma. (If two matches modify the same attribute, it looks like
the attrs are applied in the order they were added, but it may not be
deterministic?)

* Improve types

* Sort spans before processing

* Fix index boundaries in Span

* Refactor retokenizer to separate attrs methods

Add top-level `normalize_token_attrs` and `set_token_attrs` methods.

* Update AttributeRuler to use refactored methods

Update `AttributeRuler` to replace use of full retokenizer with only the
relevant methods for normalizing and setting attributes for a single
token.

* Update spacy/pipeline/attributeruler.py

Co-authored-by: Ines Montani <ines@ines.io>

* Make API more similar to EntityRuler

* Add `AttributeRuler.add_patterns` to add patterns from a list of dicts
* Return list of dicts as property `AttributeRuler.patterns`

* Make attrs_unnormed private

* Add test loading patterns from assets

* Revert "Fix index boundaries in Span"

This reverts commit 8f8a5c3386.

* Add Span index boundary checks (#5861)

* Add Span index boundary checks

* Return Span-specific IndexError in all cases

* Simplify and fix if/else

Co-authored-by: Ines Montani <ines@ines.io>
2020-08-04 17:02:39 +02:00
Sofie Van Landeghem
492d1ec5de
Prevent alignment when texts don't match (#5867)
* remove empty gold.pyx

* add alignment unit test (to be used in docs)

* ensure that Alignment is only used on equal texts

* additional test using example.alignment

* formatting

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2020-08-04 16:29:18 +02:00
Matthew Honnibal
ecb3c4e8f4
Create corpus iterator and batcher from registry during training (#5865)
* Move batchers into their own module (and registry)

* Update CLI

* Update Corpus and batcher

* Update tests

* Update one config

* Merge 'evaluation' block back under [training]

* Import batchers in gold __init__

* Fix batchers

* Update config

* Update schema

* Update util

* Don't assume train and dev are actually paths

* Update onto-joint config

* Fix missing import

* Format

* Format

* Update spacy/gold/corpus.py

Co-authored-by: Ines Montani <ines@ines.io>

* Fix name

* Update default config

* Fix get_length option in batchers

* Update test

* Add comment

* Pass path into Corpus

* Update docstring

* Update schema and configs

* Update config

* Fix test

* Fix paths

* Fix print

* Fix create_train_batches

* [training.read_train] -> [training.train_corpus]

* Update onto-joint config

Co-authored-by: Ines Montani <ines@ines.io>
2020-08-04 15:09:37 +02:00
Sofie Van Landeghem
82347110f5
Default empty KB in EL component (#5872)
* EL field documentation

* documentation consistent with docs

* default empty KB, initialize vocab separately

* formatting

* add test for changing the default entity vector length

* update comment
2020-08-04 14:34:09 +02:00
Adriane Boyd
b7e3018d97
Recalculate alignment if tokenization differs (#5868)
* Recalculate alignment if tokenization differs

* Refactor cached alignment data
2020-08-04 14:31:32 +02:00
Ines Montani
934447a611
Merge pull request #5855 from svlandeg/fix/cli-debug 2020-08-03 13:09:20 +02:00
Ines Montani
4c055f0aa7
Add init CLI and init config (#5854)
* Add init CLI and init config draft

* Improve config validation

* Auto-format

* Don't export anything in debug config

* Update docs
2020-08-02 15:18:30 +02:00
svlandeg
6f4e46ee93 Merge remote-tracking branch 'upstream/develop' into fix/cli-debug
# Conflicts:
#	pyproject.toml
#	requirements.txt
#	setup.cfg
2020-08-01 18:38:59 +02:00
Ines Montani
b40f44419b Simplify pipe analysis
- remove unused code
- don't print by default
- integrate attrs info into analysis output
2020-08-01 13:40:06 +02:00
Ines Montani
b68c53858c Remove global 2020-07-31 18:37:58 +02:00
Ines Montani
30a76fcf6f Integrate and simplify pipe analysis 2020-07-31 18:34:35 +02:00
svlandeg
9b719dfb1a use divider inbetween steps 2020-07-31 18:06:48 +02:00
svlandeg
51ffc4a166 rename pipe_name to component 2020-07-31 17:58:55 +02:00
svlandeg
878327d38e printing final predictions by default to False 2020-07-31 17:36:32 +02:00
Ines Montani
2d955fbf98 Fix linting [ci skip] 2020-07-31 17:05:28 +02:00
Ines Montani
e9e8fa2466 Update docs and types 2020-07-31 17:02:54 +02:00
svlandeg
cc2f58a1b0 use data_validation context manager 2020-07-31 16:49:42 +02:00
svlandeg
5fa3235d06 set DATA_VALIDATION to False for debug_model (upgrade thinc) 2020-07-31 15:21:01 +02:00
svlandeg
08d3c36c20 bugfix in train CLI 2020-07-31 15:03:43 +02:00
Adriane Boyd
9b509aa87f Move Language.evaluate scorer config to new arg
Move `Language.evaluate` scorer config from `component_cfg` to separate
argument `scorer_cfg`.
2020-07-31 11:05:16 +02:00
Adriane Boyd
901801b33b Fix default arguments in DependencyParser.score 2020-07-31 10:55:44 +02:00
Adriane Boyd
9d79916792 Merge branch 'develop' into feature/scorer-adjustments 2020-07-31 10:48:14 +02:00
Sofie Van Landeghem
ca491722ad
The Parser is now a Pipe (2) (#5844)
* moving syntax folder to _parser_internals

* moving nn_parser and transition_system

* move nn_parser and transition_system out of internals folder

* moving nn_parser code into transition_system file

* rename transition_system to transition_parser

* moving parser_model and _state to ml

* move _state back to internals

* The Parser now inherits from Pipe!

* small code fixes

* removing unnecessary imports

* remove link_vectors_to_models

* transition_system to internals folder

* little bit more cleanup

* newlines
2020-07-30 23:30:54 +02:00
svlandeg
0b23594953 pipe_name instead of section in debug_model 2020-07-30 20:06:28 +02:00
Ines Montani
7a21775cd0
Merge pull request #5834 from explosion/feature/vectors 2020-07-29 18:49:26 +02:00
Ines Montani
b0f57a0cac Update docs and consistency 2020-07-29 15:14:07 +02:00
Matthew Honnibal
a2d573c039 Merge branch 'feature/vectors' of https://github.com/explosion/spaCy into feature/vectors 2020-07-29 14:56:27 +02:00
Matthew Honnibal
2af741d7e3 Fix train arg 2020-07-29 14:56:01 +02:00
Matthew Honnibal
c27309f839
Merge branch 'develop' into feature/vectors 2020-07-29 14:54:10 +02:00
Ines Montani
62266fb828 Fix broken type annotation 2020-07-29 14:49:49 +02:00
Matthew Honnibal
142b58be92 Fix import 2020-07-29 14:45:09 +02:00
Matthew Honnibal
c99a653070 Adjust textcat model 2020-07-29 14:38:15 +02:00
Matthew Honnibal
9e1b11dd81 Update vectors in textcat 2020-07-29 14:35:36 +02:00
Matthew Honnibal
105cf29967 Fix DocBin 2020-07-29 14:23:13 +02:00
Ines Montani
ff0bc05da8 Fix docstrings [ci skip] 2020-07-29 14:09:37 +02:00
Ines Montani
6e2623d3f8 Fix docstring [ci skip] 2020-07-29 14:08:05 +02:00
Ines Montani
8d56260d92 Fix docstrings [ci skip] 2020-07-29 14:07:13 +02:00
Ines Montani
80b18124d2 Fix docstring [ci skip] 2020-07-29 14:03:35 +02:00
Matthew Honnibal
f0cf4a2dca Update tests 2020-07-29 14:01:14 +02:00
Matthew Honnibal
07b47eaac8 Update tok2vec layer 2020-07-29 14:01:13 +02:00
Matthew Honnibal
5ae8628571 Fix CharacterEmbed layer 2020-07-29 14:01:13 +02:00
Matthew Honnibal
97d3651574 Fix stray link_vectors_to_models call 2020-07-29 14:01:13 +02:00
Matthew Honnibal
c7d1ece3eb Update tests 2020-07-29 14:01:13 +02:00
Matthew Honnibal
00de30bcc2 Update CharacterEmbed function 2020-07-29 14:01:12 +02:00
Matthew Honnibal
6a6b09bd32 Update morphologizer model 2020-07-29 14:01:12 +02:00
Matthew Honnibal
20e9098e3f Update tests 2020-07-29 14:01:12 +02:00
Matthew Honnibal
c35d6282fc Add previous HashEmbedCNN tok2vec to make transition easier 2020-07-29 14:01:12 +02:00
Matthew Honnibal
1784c95827 Clean up link_vectors_to_models unused stuff 2020-07-29 14:01:11 +02:00
Matthew Honnibal
0c17ea4c85 Format 2020-07-29 14:00:13 +02:00
Matthew Honnibal
2aff3c4b5a Load vectors in 'spacy train' 2020-07-29 14:00:13 +02:00
Matthew Honnibal
7852a68a75 Fix load_vectors_into_model function 2020-07-29 14:00:13 +02:00
Matthew Honnibal
7299419fe4 Dont load vectors in Language.from_config 2020-07-29 14:00:12 +02:00
Matthew Honnibal
30dd96c540 Load vectors in Language.from_config 2020-07-29 14:00:12 +02:00
Matthew Honnibal
df95e2af64 Add load_vectors_into_model util 2020-07-29 14:00:12 +02:00
Matthew Honnibal
475d7c1c7c Fix StaticVectors class 2020-07-29 14:00:11 +02:00
Matthew Honnibal
44d350dc94 Use spaCy's StaticVectors 2020-07-29 14:00:11 +02:00
Matthew Honnibal
acc64e138a Add import 2020-07-29 14:00:11 +02:00
Matthew Honnibal
9987ea9e4d Fix Tok2Vec begin_training 2020-07-29 14:00:10 +02:00
Matthew Honnibal
099e9331c5 Fix tok2vec 2020-07-29 14:00:10 +02:00
Matthew Honnibal
fe0cdcd461 Fixes 2020-07-29 14:00:09 +02:00
Matthew Honnibal
123f8b832d Refactor Tok2Vec model 2020-07-29 14:00:09 +02:00
Matthew Honnibal
c6b4f63c7c Remove obsolete function 2020-07-29 14:00:09 +02:00
Matthew Honnibal
9cc7262224 Draft StaticVectors layer 2020-07-29 14:00:09 +02:00
Matthew Honnibal
cb9654e98c WIP on new StaticVectors 2020-07-29 14:00:09 +02:00
Ines Montani
e257e66ab9 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2020-07-29 11:36:45 +02:00
Ines Montani
e0ffe36e79 Update docstrings, docs and types 2020-07-29 11:36:42 +02:00
Sofie Van Landeghem
40c995b1be
Option for returning only greedy matches (#5771)
* add "greedy" option for match pattern

* distinction between greedy FIRST or LONGEST

* check for proper values, throw custom warning otherwise

* unxfail one more test

* add comment in docstring

* add test that LONGEST also prefers first match if equal length

* use c arrays for more efficient processing

* rename 'greediness' to 'greedy'
2020-07-29 11:04:43 +02:00
Adriane Boyd
191a12d75f
Fix score_weights typo in train CLI (#5835) 2020-07-29 11:04:12 +02:00
Adriane Boyd
0cddb0dbe9
Move timing into Language.evaluate (#5836)
Move timing into `Language.evaluate` so that only the processing is
timing, not processing + scoring. `Language.evaluate` returns
`scores["speed"]` as words per second, which should be identical to how
the speed was added to the scores previously. Also add the speed to the
evaluate CLI output.
2020-07-29 11:02:31 +02:00
Adriane Boyd
c689ae8f0a Fix types in Scorer 2020-07-29 10:40:30 +02:00
Ines Montani
7adffc5361 Remove unused schema 2020-07-28 23:12:47 +02:00
Ines Montani
e5d9eaf79c Tidy up docstrings and arguments 2020-07-28 23:12:42 +02:00
Ines Montani
ac24adec73 Small adjustments to Scorer and docs 2020-07-28 21:39:42 +02:00
Ines Montani
2c7a32cf12 Remove unused methods 2020-07-28 16:50:02 +02:00
Ines Montani
ba22111ff4 Move error to Errors 2020-07-28 16:24:14 +02:00
Ines Montani
2748249217 Re-add meta["pipeline"] for now 2020-07-28 16:14:23 +02:00
Ines Montani
b83ead5bf5
Merge pull request #5824 from svlandeg/fix/textcat-v3 2020-07-28 15:04:25 +02:00
Ines Montani
06a97a8766 Support --opt=value format in CLI config overrides 2020-07-28 13:43:15 +02:00
Ines Montani
ae4d8a6ffd Update docstrings, docs and pipe consistency 2020-07-28 13:37:31 +02:00
Ines Montani
0094cb0d04 Remove scores list from config and document 2020-07-28 11:22:24 +02:00
Ines Montani
894e20c466 Merge branch 'develop' into feature/component-scores 2020-07-27 18:14:39 +02:00
Ines Montani
d8b519c23c API docs, docstrings and argument consistency 2020-07-27 18:11:45 +02:00
svlandeg
85b2dcfd67 cleanup 2020-07-27 17:54:44 +02:00
svlandeg
61068e0fb1 util function dot_to_object and corresponding unit test 2020-07-27 17:50:12 +02:00
Ines Montani
10b84e1e27 Add flag to toggle sdist creation on package [ci skip] 2020-07-27 16:52:23 +02:00
Adriane Boyd
34c92dfe63 Add missing Scorer imports 2020-07-27 15:08:51 +02:00
Adriane Boyd
8bb0507777 Add and update score methods and score weights
Add and update `score` methods, provided `scores`, and default weights
`default_score_weights` for pipeline components.

* `scores` provides all top-level keys returned by `score` (merely informative, similar to `assigns`).
* `default_score_weights` provides the default weights for a default config.
* The keys from `default_score_weights` determine which values will be
shown in the `spacy train` output, so keys with weight `0.0` will be
displayed but not counted toward the overall score.
2020-07-27 14:44:53 +02:00
Adriane Boyd
baf19fd652 Update cats scoring to provide overall score
* Provide top-level score as `attr_score`
* Provide a description of the score as `attr_score_desc`
* Provide all potential scores keys, setting unused keys to `None`
* Update CLI evaluate accordingly
2020-07-27 12:26:10 +02:00
Adriane Boyd
f8cf378be9 Combine weights from multiple components
Combine weights from multiple components for the same score.
2020-07-27 10:21:31 +02:00
Ines Montani
3d56a3f286 Make more args keyword-only 2020-07-27 00:27:53 +02:00
Matthew Honnibal
80271ac0ba Update default config 2020-07-26 15:27:39 +02:00
Ines Montani
ed61fb10fc Rename default textcat arch to TextCatEnsemble 2020-07-26 15:11:43 +02:00
Ines Montani
53d37da29a Make sure @factories is removed from config 2020-07-26 15:11:24 +02:00
Ines Montani
4060c2d5a6 Fix test 2020-07-26 13:40:19 +02:00
Ines Montani
2470486543 Allow pipeline components to set default scores and weights 2020-07-26 13:18:43 +02:00
Ines Montani
787d066e22 Remove pipes.pyx
Probably accidentally re-added in a merge?
2020-07-26 13:08:52 +02:00
Matthew Honnibal
520d25cb50
Add smart_open dependency to fetch project assets (#5812)
* Use smart_open for project assets

* Fix assets.py

* Update pyproject.toml
2020-07-26 12:15:00 +02:00
Ines Montani
e92df281ce Tidy up, autoformat, add types 2020-07-25 15:01:15 +02:00
Matthew Honnibal
71242327b2 Set version to v3.0.0a5 2020-07-25 14:06:01 +02:00
Ines Montani
cdbd6ba912
Merge pull request #5798 from explosion/feature/language-data-config 2020-07-25 13:34:49 +02:00
Ines Montani
49f27a2a7b Tidy up [ci skip] 2020-07-25 13:00:49 +02:00
Ines Montani
4a0a692875 Add missing lex_attr_getters (resolves #5806 ) 2020-07-25 12:55:18 +02:00
Adriane Boyd
2bcceb80c4
Refactor the Scorer to improve flexibility (#5731)
* Refactor the Scorer to improve flexibility

Refactor the `Scorer` to improve flexibility for arbitrary pipeline
components.

* Individual pipeline components provide their own `evaluate` methods
that score a list of `Example`s and return a dictionary of scores
* `Scorer` is initialized either:
  * with a provided pipeline containing components to be scored
  * with a default pipeline containing the built-in statistical
    components (senter, tagger, morphologizer, parser, ner)
* `Scorer.score` evaluates a list of `Example`s and returns a dictionary
of scores referring to the scores provided by the components in the
pipeline

Significant differences:

* `tags_acc` is renamed to `tag_acc` to be consistent with `token_acc`
and the new `morph_acc`, `pos_acc`, and `lemma_acc`
* Scoring is no longer cumulative: `Scorer.score` scores a list of
examples rather than a single example and does not retain any state
about previously scored examples
* PRF values in the returned scores are no longer multiplied by 100

* Add kwargs to Morphologizer.evaluate

* Create generalized scoring methods in Scorer

* Generalized static scoring methods are added to `Scorer`
  * Methods require an attribute (either on Token or Doc) that is
used to key the returned scores

Naming differences:

* `uas`, `las`, and `las_per_type` in the scores dict are renamed to
`dep_uas`, `dep_las`, and `dep_las_per_type`

Scoring differences:

* `Doc.sents` is now scored as spans rather than on sentence-initial
token positions so that `Doc.sents` and `Doc.ents` can be scored with
the same method (this lowers scores since a single incorrect sentence
start results in two incorrect spans)

* Simplify / extend hasattr check for eval method

* Add hasattr check to tokenizer scoring
* Simplify to hasattr check for component scoring

* Reset Example alignment if docs are set

Reset the Example alignment if either doc is set in case the
tokenization has changed.

* Add PRF tokenization scoring for tokens as spans

Add PRF scores for tokens as character spans. The scores are:

* token_acc: # correct tokens / # gold tokens
* token_p/r/f: PRF for (token.idx, token.idx + len(token))

* Add docstring to Scorer.score_tokenization

* Rename component.evaluate() to component.score()

* Update Scorer API docs

* Update scoring for positive_label in textcat

* Fix TextCategorizer.score kwargs

* Update Language.evaluate docs

* Update score names in default config
2020-07-25 12:53:02 +02:00
Ines Montani
c003d26b94 Tidy up 2020-07-25 12:21:37 +02:00
Ines Montani
a063a82c40 Tidy up __init__.py 2020-07-25 12:14:37 +02:00
Ines Montani
8d9d28eb8b Re-add setting for vocab data and tidy up 2020-07-25 12:14:28 +02:00
Ines Montani
b9aaa4e457 Improve vocab data integration and warning 2020-07-25 11:51:30 +02:00
Ines Montani
38f6ea7a78 Simplify language data and revert detailed configs 2020-07-24 14:50:26 +02:00
Adriane Boyd
656574a01a
Update Japanese tests (#5807)
* Update POS tests to reflect current behavior (it is not entirely clear
whether the AUX/VERB mapping is indeed the desired behavior?)
* Switch to `from_config` initialization in subtoken test
2020-07-24 12:45:14 +02:00
Adriane Boyd
fdb8815ef5
Minor refactor for Morphology and MorphAnalysis (#5804)
* `MorphAnalysis.get` returns only the field values
* Move `_normalize_props` inside `Morphology` as
`Morphology.normalize_attrs` and simplify
  * Simplify POS field detection/conversion
  * Convert all non-POS features to strings
* `Morphology` returns an empty string for a missing morph to align
with the FEATS string returned for an existing morph
* Remove unused `list_to_feats`
2020-07-24 09:28:06 +02:00
Ines Montani
87737a5a60 Tidy up 2020-07-23 00:16:23 +02:00
Ines Montani
a624ae0675 Remove POS, TAG and LEMMA from tokenizer exceptions 2020-07-22 23:09:01 +02:00
Ines Montani
14d7d46f89 Merge branch 'develop' into feature/language-data-config 2020-07-22 22:18:53 +02:00
Ines Montani
b507f61629 Tidy up and move noun_chunks, token_match, url_match 2020-07-22 22:18:46 +02:00
Ines Montani
7fc4dadd22 Fix typo 2020-07-22 20:27:22 +02:00
Ines Montani
d0c6d1efc5
@factories -> factory (#5801) 2020-07-22 17:29:31 +02:00
Ines Montani
2c5bb59909 Use consistent --gpu-id option name 2020-07-22 16:53:41 +02:00
Ines Montani
0fcd352179 Remove omit_extra_lookups 2020-07-22 16:01:17 +02:00
Ines Montani
945f795a3e WIP: move more language data to config 2020-07-22 15:59:37 +02:00
Adriane Boyd
b84fd70cc3
Fix exceptions for Morphology.__reduce__ (#5792)
Pickle exceptions in the MORPH_RULES format instead of the internal
format after the recent `Morphology.__init__` changes.
2020-07-22 15:00:25 +02:00
Ines Montani
43b960c01b
Refactor pipeline components, config and language data (#5759)
* Update with WIP

* Update with WIP

* Update with pipeline serialization

* Update types and pipe factories

* Add deep merge, tidy up and add tests

* Fix pipe creation from config

* Don't validate default configs on load

* Update spacy/language.py

Co-authored-by: Ines Montani <ines@ines.io>

* Adjust factory/component meta error

* Clean up factory args and remove defaults

* Add test for failing empty dict defaults

* Update pipeline handling and methods

* provide KB as registry function instead of as object

* small change in test to make functionality more clear

* update example script for EL configuration

* Fix typo

* Simplify test

* Simplify test

* splitting pipes.pyx into separate files

* moving default configs to each component file

* fix batch_size type

* removing default values from component constructors where possible (TODO: test 4725)

* skip instead of xfail

* Add test for config -> nlp with multiple instances

* pipeline.pipes -> pipeline.pipe

* Tidy up, document, remove kwargs

* small cleanup/generalization for Tok2VecListener

* use DEFAULT_UPSTREAM field

* revert to avoid circular imports

* Fix tests

* Replace deprecated arg

* Make model dirs require config

* fix pickling of keyword-only arguments in constructor

* WIP: clean up and integrate full config

* Add helper to handle function args more reliably

Now also includes keyword-only args

* Fix config composition and serialization

* Improve config debugging and add visual diff

* Remove unused defaults and fix type

* Remove pipeline and factories from meta

* Update spacy/default_config.cfg

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update spacy/default_config.cfg

* small UX edits

* avoid printing stack trace for debug CLI commands

* Add support for language-specific factories

* specify the section of the config which holds the model to debug

* WIP: add Language.from_config

* Update with language data refactor WIP

* Auto-format

* Add backwards-compat handling for Language.factories

* Update morphologizer.pyx

* Fix morphologizer

* Update and simplify lemmatizers

* Fix Japanese tests

* Port over tagger changes

* Fix Chinese and tests

* Update to latest Thinc

* WIP: xfail first Russian lemmatizer test

* Fix component-specific overrides

* fix nO for output layers in debug_model

* Fix default value

* Fix tests and don't pass objects in config

* Fix deep merging

* Fix lemma lookup data registry

Only load the lookups if an entry is available in the registry (and if spacy-lookups-data is installed)

* Add types

* Add Vocab.from_config

* Fix typo

* Fix tests

* Make config copying more elegant

* Fix pipe analysis

* Fix lemmatizers and is_base_form

* WIP: move language defaults to config

* Fix morphology type

* Fix vocab

* Remove comment

* Update to latest Thinc

* Add morph rules to config

* Tidy up

* Remove set_morphology option from tagger factory

* Hack use_gpu

* Move [pipeline] to top-level block and make [nlp.pipeline] list

Allows separating component blocks from component order – otherwise, ordering the config would mean a changed component order, which is bad. Also allows initial config to define more components and not use all of them

* Fix use_gpu and resume in CLI

* Auto-format

* Remove resume from config

* Fix formatting and error

* [pipeline] -> [components]

* Fix types

* Fix tagger test: requires set_morphology?

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>
Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2020-07-22 13:42:59 +02:00
Ines Montani
311d0bde29
Merge pull request #5788 from explosion/master-tmp 2020-07-20 15:39:24 +02:00
Ines Montani
d51db72e46 Remove Python 2 marker 2020-07-20 15:01:36 +02:00
Ines Montani
644074b954 Merge branch 'develop' into master-tmp 2020-07-20 14:58:04 +02:00
Sofie Van Landeghem
c9da9605f7
Test suite clean up (#5781)
* step_through tests: skip instead of xfail

* test_empty_doc should be fixed with new Thinc version

* remove outdated test (there are other misaligned tests now)

* xfail reason

* fix test according to french exceptions

* clarified some skipped tests

* skip ukranian test instead of xfail

* skip instead of xfail

* skip + reason instead of xfail

* removed obsolete tests referring to removed "set_frozen" functionality

* fix test 999

* remove unused AlignmentError

* remove xfail where possible, skip otherwise

* increment thinc release for empty_doc test
2020-07-20 14:49:54 +02:00
Sofie Van Landeghem
1b2ec94382
Hyphen infix (#5770)
* infix split on hyphen when preceded by number

* clean up

* skip ukranian test instead of xfail
2020-07-20 14:48:51 +02:00
Adriane Boyd
ec819fc311
Provide default output for evaluate in CLI (#5784) 2020-07-20 14:42:46 +02:00
Ines Montani
cb65b36839
Merge pull request #5767 from adrianeboyd/feature/remove-tag-maps 2020-07-19 15:15:34 +02:00
Ines Montani
fa3c98f8b3 Update train.py 2020-07-19 13:40:47 +02:00
Ines Montani
796f6c52d1 Merge branch 'develop' into pr/5767 2020-07-19 13:37:46 +02:00
Adriane Boyd
39ebcd9ec9
Refactor Chinese tokenizer configuration (#5736)
* Refactor Chinese tokenizer configuration

Refactor `ChineseTokenizer` configuration so that it uses a single
`segmenter` setting to choose between character segmentation, jieba, and
pkuseg.

* replace `use_jieba`, `use_pkuseg`, `require_pkuseg` with the setting
`segmenter` with the supported values: `char`, `jieba`, `pkuseg`
* make the default segmenter plain character segmentation `char` (no
additional libraries required)

* Fix Chinese serialization test to use char default

* Warn if attempting to customize other segmenter

Add a warning if `Chinese.pkuseg_update_user_dict` is called when
another segmenter is selected.
2020-07-19 13:34:37 +02:00
Adriane Boyd
9ee1c54f40
Improve tag map initialization and updating (#5764)
* Improve tag map initialization and updating

Generalize tag map initialization and updating so that the tag map can
be loaded correctly prior to loading a `Corpus` with `spacy debug-data`
and `spacy train`.

* normalize provided tag map as necessary
* use the same method for initializing and updating the tag map

* Replace rather than update tag map

Replace rather than update tag map when loading a custom tag map.
Updating the tag map is problematic due to the sorted list of tag names
and the fact that the tag map will contain lingering/unwanted tags from
the default tag map.

* Update CLI scripts

* Reinitialize cache after loading new tag map

Reinitialize the cache with the right size after loading a new tag map.
2020-07-19 13:13:57 +02:00
Adriane Boyd
597bcc629e
Improve tag map initialization and updating (#5768)
* Improve tag map initialization and updating

Generalize tag map initialization and updating so that a provided tag
map can be loaded correctly in the CLI.

* normalize provided tag map as necessary
* use the same method for initializing and overwriting the tag map

* Reinitialize cache after loading new tag map

Reinitialize the cache with the right size after loading a new tag map.
2020-07-19 11:13:39 +02:00
Adriane Boyd
b81a89f0a9
Update morphologizer (#5766)
* update `Morphologizer.begin_training` for use with `Example`

* make init and begin_training more consistent

* add `Morphology.normalize_features` to normalize outside of
`Morphology.add`

* make sure `get_loss` doesn't create unknown labels when the POS and
morph alignments differ
2020-07-19 11:10:51 +02:00
Adriane Boyd
cd5af72c9a
Update pkuseg version (#5774)
* Update pkuseg version in Chinese tokenizer warnings
* Update pkuseg version in `Makefile`
* Remove warning about python3.8 wheels in docs
2020-07-19 11:09:49 +02:00
Adriane Boyd
50db3f0cdb Serialize morph rules with tagger
Serialize `morph_rules` with the tagger alongside the `tag_map`.

Use `Morphology.load_tag_map` and `Morphology.load_morph_exceptions` to
load these settings rather than reinitializing the morphology each time
they are changed.
2020-07-17 08:22:21 +02:00
Adriane Boyd
d106cf66dd Update Morphology to load exceptions as MORPH_RULES
Update `Morphology` to load exceptions in `Morphology.__init__` and
`Morphology.load_morph_exceptions` from the format used in `MORPH_RULES`
rather than the internal format with tuple keys.

* Rename to `Morphology.exc` to `Morphology._exc` for internal use with
tuple keys
* Add `Morphology.exc` as a property that converts the internal `_exc`
back to `MORPH_RULES` format, primarily for serialization
2020-07-16 21:16:49 +02:00
Adriane Boyd
d83e3c44c5 Remove corpus-specific morph rules
* Remove corpus-specific morph rules
* Add options similar to tag maps to provide them in the `train` and
`debug-data` CLIs
2020-07-15 19:44:18 +02:00
Adriane Boyd
2f981d5af1 Remove corpus-specific tag maps
Remove corpus-specific tag maps from the language data for languages
without custom tokenizers. For languages with custom word segmenters
that also provide tags (Japanese and Korean), the tag maps for the
custom tokenizers are kept as the default.

The default tag maps for languages without custom tokenizers are now the
default tag map from `lang/tag_map/py`, UPOS -> UPOS.
2020-07-15 15:58:29 +02:00
Adriane Boyd
5228920e2f
Clarify warning W030 for misaligned BILUO tags (#5761) 2020-07-14 14:09:48 +02:00
Adriane Boyd
a7a7e0d2a6
Add morph to morphology in Doc.from_array (#5762)
* Add morph to morphology in Doc.from_array

Add morphological analyses to morphology table in `Doc.from_array`.

* Use separate vocab in DocBin roundtrip test
2020-07-14 14:07:35 +02:00
Ines Montani
872938ec76
Merge pull request #5747 from explosion/feature/refactor-config-args 2020-07-14 00:00:22 +02:00
Sofie Van Landeghem
6f3bb6f77c
fix doc.to_utf8 on GPU (#5757) 2020-07-13 23:05:33 +02:00
Adriane Boyd
7ea2cc7650
Set version to 2.3.2 (#5756) 2020-07-13 14:55:56 +02:00
Mark Neumann
27a1cd3c63
fix meta serialization in train (#5751)
Co-authored-by: Mark Neumann <markng@allenai.org>
2020-07-12 22:06:46 +02:00
Ines Montani
ed55143c0d Merge branch 'develop' into compat/remove-object-subclass 2020-07-12 14:28:52 +02:00