spaCy

mirror of https://github.com/explosion/spaCy.git synced 2024-12-26 09:56:28 +03:00

Author	SHA1	Message	Date
Adriane Boyd	af125875cf	Update SimpleNER (#5878 ) * Fix `get_loss` to use NER annotation * Add labels as part of cfg * Add simple overfitting test	2020-08-05 14:43:29 +02:00
Sofie Van Landeghem	b88c5c701a	Bugfix in nlp.replace_pipe (#5875 ) * bugfix and unit test * merge two conditions	2020-08-05 09:30:58 +02:00
Ines Montani	b795f02fbd	Allow adding pipeline components from source model (#5857 ) * Allow adding pipeline components from source model * Config: name -> component * Improve error messages * Fix error and test * Add frozen components and exclude logic * Remove exclude from Language.evaluate * Init sourced components with current vocab * Fix error codes	2020-08-04 23:39:19 +02:00
Sofie Van Landeghem	34873c4911	Example Dict format consistency (#5858 ) * consistently use upper-case IDS in token_annotation format and for get_aligned * remove ID from to_dict (not used in from_dict either) * fix test Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-08-04 22:22:26 +02:00
Adriane Boyd	fa79a0db9f	Add AttributeRuler for token attribute exceptions (#5842 ) * Add AttributeRuler for token attribute exceptions Add the `AttributeRuler` to handle exceptions for token-level attributes. The `AttributeRuler` uses `Matcher` patterns to identify target spans and applies the specified attributes to the token at the provided index in the matched span. A negative index can be used to index from the end of the matched span. The retokenizer is used to "merge" the individual tokens and assign them the provided attributes. Helper functions can import existing tag maps and morph rules to the corresponding `Matcher` patterns. There is an additional minor bug fix for `MORPH` attributes in the retokenizer to correctly normalize the values and to handle `MORPH` alongside `_` in an attrs dict. * Fix default name * Update name in error message * Extend AttributeRuler functionality * Add option to initialize with a dict of AttributeRuler patterns * Instead of silently discarding overlapping matches (the default behavior for the retokenizer if only the attrs differ), split the matches into disjoint sets and retokenize each set separately. This allows, for instance, one pattern to set the POS and another pattern to set the lemma. (If two matches modify the same attribute, it looks like the attrs are applied in the order they were added, but it may not be deterministic?) * Improve types * Sort spans before processing * Fix index boundaries in Span * Refactor retokenizer to separate attrs methods Add top-level `normalize_token_attrs` and `set_token_attrs` methods. * Update AttributeRuler to use refactored methods Update `AttributeRuler` to replace use of full retokenizer with only the relevant methods for normalizing and setting attributes for a single token. * Update spacy/pipeline/attributeruler.py Co-authored-by: Ines Montani <ines@ines.io> * Make API more similar to EntityRuler * Add `AttributeRuler.add_patterns` to add patterns from a list of dicts * Return list of dicts as property `AttributeRuler.patterns` * Make attrs_unnormed private * Add test loading patterns from assets * Revert "Fix index boundaries in Span" This reverts commit `8f8a5c3386`. * Add Span index boundary checks (#5861) * Add Span index boundary checks * Return Span-specific IndexError in all cases * Simplify and fix if/else Co-authored-by: Ines Montani <ines@ines.io>	2020-08-04 17:02:39 +02:00
Sofie Van Landeghem	492d1ec5de	Prevent alignment when texts don't match (#5867 ) * remove empty gold.pyx * add alignment unit test (to be used in docs) * ensure that Alignment is only used on equal texts * additional test using example.alignment * formatting Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-08-04 16:29:18 +02:00
Matthew Honnibal	ecb3c4e8f4	Create corpus iterator and batcher from registry during training (#5865 ) * Move batchers into their own module (and registry) * Update CLI * Update Corpus and batcher * Update tests * Update one config * Merge 'evaluation' block back under [training] * Import batchers in gold __init__ * Fix batchers * Update config * Update schema * Update util * Don't assume train and dev are actually paths * Update onto-joint config * Fix missing import * Format * Format * Update spacy/gold/corpus.py Co-authored-by: Ines Montani <ines@ines.io> * Fix name * Update default config * Fix get_length option in batchers * Update test * Add comment * Pass path into Corpus * Update docstring * Update schema and configs * Update config * Fix test * Fix paths * Fix print * Fix create_train_batches * [training.read_train] -> [training.train_corpus] * Update onto-joint config Co-authored-by: Ines Montani <ines@ines.io>	2020-08-04 15:09:37 +02:00
Sofie Van Landeghem	82347110f5	Default empty KB in EL component (#5872 ) * EL field documentation * documentation consistent with docs * default empty KB, initialize vocab separately * formatting * add test for changing the default entity vector length * update comment	2020-08-04 14:34:09 +02:00
Adriane Boyd	b7e3018d97	Recalculate alignment if tokenization differs (#5868 ) * Recalculate alignment if tokenization differs * Refactor cached alignment data	2020-08-04 14:31:32 +02:00
Ines Montani	934447a611	Merge pull request #5855 from svlandeg/fix/cli-debug	2020-08-03 13:09:20 +02:00
Ines Montani	4c055f0aa7	Add init CLI and init config (#5854 ) * Add init CLI and init config draft * Improve config validation * Auto-format * Don't export anything in debug config * Update docs	2020-08-02 15:18:30 +02:00
svlandeg	6f4e46ee93	Merge remote-tracking branch 'upstream/develop' into fix/cli-debug # Conflicts: # pyproject.toml # requirements.txt # setup.cfg	2020-08-01 18:38:59 +02:00
Ines Montani	b40f44419b	Simplify pipe analysis - remove unused code - don't print by default - integrate attrs info into analysis output	2020-08-01 13:40:06 +02:00
Ines Montani	b68c53858c	Remove global	2020-07-31 18:37:58 +02:00
Ines Montani	30a76fcf6f	Integrate and simplify pipe analysis	2020-07-31 18:34:35 +02:00
svlandeg	9b719dfb1a	use divider inbetween steps	2020-07-31 18:06:48 +02:00
svlandeg	51ffc4a166	rename pipe_name to component	2020-07-31 17:58:55 +02:00
svlandeg	878327d38e	printing final predictions by default to False	2020-07-31 17:36:32 +02:00
Ines Montani	2d955fbf98	Fix linting [ci skip]	2020-07-31 17:05:28 +02:00
Ines Montani	e9e8fa2466	Update docs and types	2020-07-31 17:02:54 +02:00
svlandeg	cc2f58a1b0	use data_validation context manager	2020-07-31 16:49:42 +02:00
svlandeg	5fa3235d06	set DATA_VALIDATION to False for debug_model (upgrade thinc)	2020-07-31 15:21:01 +02:00
svlandeg	08d3c36c20	bugfix in train CLI	2020-07-31 15:03:43 +02:00
Adriane Boyd	9b509aa87f	Move Language.evaluate scorer config to new arg Move `Language.evaluate` scorer config from `component_cfg` to separate argument `scorer_cfg`.	2020-07-31 11:05:16 +02:00
Adriane Boyd	901801b33b	Fix default arguments in DependencyParser.score	2020-07-31 10:55:44 +02:00
Adriane Boyd	9d79916792	Merge branch 'develop' into feature/scorer-adjustments	2020-07-31 10:48:14 +02:00
Sofie Van Landeghem	ca491722ad	The Parser is now a Pipe (2) (#5844 ) * moving syntax folder to _parser_internals * moving nn_parser and transition_system * move nn_parser and transition_system out of internals folder * moving nn_parser code into transition_system file * rename transition_system to transition_parser * moving parser_model and _state to ml * move _state back to internals * The Parser now inherits from Pipe! * small code fixes * removing unnecessary imports * remove link_vectors_to_models * transition_system to internals folder * little bit more cleanup * newlines	2020-07-30 23:30:54 +02:00
svlandeg	0b23594953	pipe_name instead of section in debug_model	2020-07-30 20:06:28 +02:00
Ines Montani	7a21775cd0	Merge pull request #5834 from explosion/feature/vectors	2020-07-29 18:49:26 +02:00
Ines Montani	b0f57a0cac	Update docs and consistency	2020-07-29 15:14:07 +02:00
Matthew Honnibal	a2d573c039	Merge branch 'feature/vectors' of https://github.com/explosion/spaCy into feature/vectors	2020-07-29 14:56:27 +02:00
Matthew Honnibal	2af741d7e3	Fix train arg	2020-07-29 14:56:01 +02:00
Matthew Honnibal	c27309f839	Merge branch 'develop' into feature/vectors	2020-07-29 14:54:10 +02:00
Ines Montani	62266fb828	Fix broken type annotation	2020-07-29 14:49:49 +02:00
Matthew Honnibal	142b58be92	Fix import	2020-07-29 14:45:09 +02:00
Matthew Honnibal	c99a653070	Adjust textcat model	2020-07-29 14:38:15 +02:00
Matthew Honnibal	9e1b11dd81	Update vectors in textcat	2020-07-29 14:35:36 +02:00
Matthew Honnibal	105cf29967	Fix DocBin	2020-07-29 14:23:13 +02:00
Ines Montani	ff0bc05da8	Fix docstrings [ci skip]	2020-07-29 14:09:37 +02:00
Ines Montani	6e2623d3f8	Fix docstring [ci skip]	2020-07-29 14:08:05 +02:00
Ines Montani	8d56260d92	Fix docstrings [ci skip]	2020-07-29 14:07:13 +02:00
Ines Montani	80b18124d2	Fix docstring [ci skip]	2020-07-29 14:03:35 +02:00
Matthew Honnibal	f0cf4a2dca	Update tests	2020-07-29 14:01:14 +02:00
Matthew Honnibal	07b47eaac8	Update tok2vec layer	2020-07-29 14:01:13 +02:00
Matthew Honnibal	5ae8628571	Fix CharacterEmbed layer	2020-07-29 14:01:13 +02:00
Matthew Honnibal	97d3651574	Fix stray link_vectors_to_models call	2020-07-29 14:01:13 +02:00
Matthew Honnibal	c7d1ece3eb	Update tests	2020-07-29 14:01:13 +02:00
Matthew Honnibal	00de30bcc2	Update CharacterEmbed function	2020-07-29 14:01:12 +02:00
Matthew Honnibal	6a6b09bd32	Update morphologizer model	2020-07-29 14:01:12 +02:00
Matthew Honnibal	20e9098e3f	Update tests	2020-07-29 14:01:12 +02:00
Matthew Honnibal	c35d6282fc	Add previous HashEmbedCNN tok2vec to make transition easier	2020-07-29 14:01:12 +02:00
Matthew Honnibal	1784c95827	Clean up link_vectors_to_models unused stuff	2020-07-29 14:01:11 +02:00
Matthew Honnibal	0c17ea4c85	Format	2020-07-29 14:00:13 +02:00
Matthew Honnibal	2aff3c4b5a	Load vectors in 'spacy train'	2020-07-29 14:00:13 +02:00
Matthew Honnibal	7852a68a75	Fix load_vectors_into_model function	2020-07-29 14:00:13 +02:00
Matthew Honnibal	7299419fe4	Dont load vectors in Language.from_config	2020-07-29 14:00:12 +02:00
Matthew Honnibal	30dd96c540	Load vectors in Language.from_config	2020-07-29 14:00:12 +02:00
Matthew Honnibal	df95e2af64	Add load_vectors_into_model util	2020-07-29 14:00:12 +02:00
Matthew Honnibal	475d7c1c7c	Fix StaticVectors class	2020-07-29 14:00:11 +02:00
Matthew Honnibal	44d350dc94	Use spaCy's StaticVectors	2020-07-29 14:00:11 +02:00
Matthew Honnibal	acc64e138a	Add import	2020-07-29 14:00:11 +02:00
Matthew Honnibal	9987ea9e4d	Fix Tok2Vec begin_training	2020-07-29 14:00:10 +02:00
Matthew Honnibal	099e9331c5	Fix tok2vec	2020-07-29 14:00:10 +02:00
Matthew Honnibal	fe0cdcd461	Fixes	2020-07-29 14:00:09 +02:00
Matthew Honnibal	123f8b832d	Refactor Tok2Vec model	2020-07-29 14:00:09 +02:00
Matthew Honnibal	c6b4f63c7c	Remove obsolete function	2020-07-29 14:00:09 +02:00
Matthew Honnibal	9cc7262224	Draft StaticVectors layer	2020-07-29 14:00:09 +02:00
Matthew Honnibal	cb9654e98c	WIP on new StaticVectors	2020-07-29 14:00:09 +02:00
Ines Montani	e257e66ab9	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2020-07-29 11:36:45 +02:00
Ines Montani	e0ffe36e79	Update docstrings, docs and types	2020-07-29 11:36:42 +02:00
Sofie Van Landeghem	40c995b1be	Option for returning only greedy matches (#5771 ) * add "greedy" option for match pattern * distinction between greedy FIRST or LONGEST * check for proper values, throw custom warning otherwise * unxfail one more test * add comment in docstring * add test that LONGEST also prefers first match if equal length * use c arrays for more efficient processing * rename 'greediness' to 'greedy'	2020-07-29 11:04:43 +02:00
Adriane Boyd	191a12d75f	Fix score_weights typo in train CLI (#5835 )	2020-07-29 11:04:12 +02:00
Adriane Boyd	0cddb0dbe9	Move timing into Language.evaluate (#5836 ) Move timing into `Language.evaluate` so that only the processing is timing, not processing + scoring. `Language.evaluate` returns `scores["speed"]` as words per second, which should be identical to how the speed was added to the scores previously. Also add the speed to the evaluate CLI output.	2020-07-29 11:02:31 +02:00
Adriane Boyd	c689ae8f0a	Fix types in Scorer	2020-07-29 10:40:30 +02:00
Ines Montani	7adffc5361	Remove unused schema	2020-07-28 23:12:47 +02:00
Ines Montani	e5d9eaf79c	Tidy up docstrings and arguments	2020-07-28 23:12:42 +02:00
Ines Montani	ac24adec73	Small adjustments to Scorer and docs	2020-07-28 21:39:42 +02:00
Ines Montani	2c7a32cf12	Remove unused methods	2020-07-28 16:50:02 +02:00
Ines Montani	ba22111ff4	Move error to Errors	2020-07-28 16:24:14 +02:00
Ines Montani	2748249217	Re-add meta["pipeline"] for now	2020-07-28 16:14:23 +02:00
Ines Montani	b83ead5bf5	Merge pull request #5824 from svlandeg/fix/textcat-v3	2020-07-28 15:04:25 +02:00
Ines Montani	06a97a8766	Support --opt=value format in CLI config overrides	2020-07-28 13:43:15 +02:00
Ines Montani	ae4d8a6ffd	Update docstrings, docs and pipe consistency	2020-07-28 13:37:31 +02:00
Ines Montani	0094cb0d04	Remove scores list from config and document	2020-07-28 11:22:24 +02:00
Ines Montani	894e20c466	Merge branch 'develop' into feature/component-scores	2020-07-27 18:14:39 +02:00
Ines Montani	d8b519c23c	API docs, docstrings and argument consistency	2020-07-27 18:11:45 +02:00
svlandeg	85b2dcfd67	cleanup	2020-07-27 17:54:44 +02:00
svlandeg	61068e0fb1	util function dot_to_object and corresponding unit test	2020-07-27 17:50:12 +02:00
Ines Montani	10b84e1e27	Add flag to toggle sdist creation on package [ci skip]	2020-07-27 16:52:23 +02:00
Adriane Boyd	34c92dfe63	Add missing Scorer imports	2020-07-27 15:08:51 +02:00
Adriane Boyd	8bb0507777	Add and update score methods and score weights Add and update `score` methods, provided `scores`, and default weights `default_score_weights` for pipeline components. * `scores` provides all top-level keys returned by `score` (merely informative, similar to `assigns`). * `default_score_weights` provides the default weights for a default config. * The keys from `default_score_weights` determine which values will be shown in the `spacy train` output, so keys with weight `0.0` will be displayed but not counted toward the overall score.	2020-07-27 14:44:53 +02:00
Adriane Boyd	baf19fd652	Update cats scoring to provide overall score * Provide top-level score as `attr_score` * Provide a description of the score as `attr_score_desc` * Provide all potential scores keys, setting unused keys to `None` * Update CLI evaluate accordingly	2020-07-27 12:26:10 +02:00
Adriane Boyd	f8cf378be9	Combine weights from multiple components Combine weights from multiple components for the same score.	2020-07-27 10:21:31 +02:00
Ines Montani	3d56a3f286	Make more args keyword-only	2020-07-27 00:27:53 +02:00
Matthew Honnibal	80271ac0ba	Update default config	2020-07-26 15:27:39 +02:00
Ines Montani	ed61fb10fc	Rename default textcat arch to TextCatEnsemble	2020-07-26 15:11:43 +02:00
Ines Montani	53d37da29a	Make sure @factories is removed from config	2020-07-26 15:11:24 +02:00
Ines Montani	4060c2d5a6	Fix test	2020-07-26 13:40:19 +02:00
Ines Montani	2470486543	Allow pipeline components to set default scores and weights	2020-07-26 13:18:43 +02:00
Ines Montani	787d066e22	Remove pipes.pyx Probably accidentally re-added in a merge?	2020-07-26 13:08:52 +02:00
Matthew Honnibal	520d25cb50	Add smart_open dependency to fetch project assets (#5812 ) * Use smart_open for project assets * Fix assets.py * Update pyproject.toml	2020-07-26 12:15:00 +02:00
Ines Montani	e92df281ce	Tidy up, autoformat, add types	2020-07-25 15:01:15 +02:00
Matthew Honnibal	71242327b2	Set version to v3.0.0a5	2020-07-25 14:06:01 +02:00
Ines Montani	cdbd6ba912	Merge pull request #5798 from explosion/feature/language-data-config	2020-07-25 13:34:49 +02:00
Ines Montani	49f27a2a7b	Tidy up [ci skip]	2020-07-25 13:00:49 +02:00
Ines Montani	4a0a692875	Add missing lex_attr_getters (resolves #5806 )	2020-07-25 12:55:18 +02:00
Adriane Boyd	2bcceb80c4	Refactor the Scorer to improve flexibility (#5731 ) * Refactor the Scorer to improve flexibility Refactor the `Scorer` to improve flexibility for arbitrary pipeline components. * Individual pipeline components provide their own `evaluate` methods that score a list of `Example`s and return a dictionary of scores * `Scorer` is initialized either: * with a provided pipeline containing components to be scored * with a default pipeline containing the built-in statistical components (senter, tagger, morphologizer, parser, ner) * `Scorer.score` evaluates a list of `Example`s and returns a dictionary of scores referring to the scores provided by the components in the pipeline Significant differences: * `tags_acc` is renamed to `tag_acc` to be consistent with `token_acc` and the new `morph_acc`, `pos_acc`, and `lemma_acc` * Scoring is no longer cumulative: `Scorer.score` scores a list of examples rather than a single example and does not retain any state about previously scored examples * PRF values in the returned scores are no longer multiplied by 100 * Add kwargs to Morphologizer.evaluate * Create generalized scoring methods in Scorer * Generalized static scoring methods are added to `Scorer` * Methods require an attribute (either on Token or Doc) that is used to key the returned scores Naming differences: * `uas`, `las`, and `las_per_type` in the scores dict are renamed to `dep_uas`, `dep_las`, and `dep_las_per_type` Scoring differences: * `Doc.sents` is now scored as spans rather than on sentence-initial token positions so that `Doc.sents` and `Doc.ents` can be scored with the same method (this lowers scores since a single incorrect sentence start results in two incorrect spans) * Simplify / extend hasattr check for eval method * Add hasattr check to tokenizer scoring * Simplify to hasattr check for component scoring * Reset Example alignment if docs are set Reset the Example alignment if either doc is set in case the tokenization has changed. * Add PRF tokenization scoring for tokens as spans Add PRF scores for tokens as character spans. The scores are: * token_acc: # correct tokens / # gold tokens * token_p/r/f: PRF for (token.idx, token.idx + len(token)) * Add docstring to Scorer.score_tokenization * Rename component.evaluate() to component.score() * Update Scorer API docs * Update scoring for positive_label in textcat * Fix TextCategorizer.score kwargs * Update Language.evaluate docs * Update score names in default config	2020-07-25 12:53:02 +02:00
Ines Montani	c003d26b94	Tidy up	2020-07-25 12:21:37 +02:00
Ines Montani	a063a82c40	Tidy up __init__.py	2020-07-25 12:14:37 +02:00
Ines Montani	8d9d28eb8b	Re-add setting for vocab data and tidy up	2020-07-25 12:14:28 +02:00
Ines Montani	b9aaa4e457	Improve vocab data integration and warning	2020-07-25 11:51:30 +02:00
Ines Montani	38f6ea7a78	Simplify language data and revert detailed configs	2020-07-24 14:50:26 +02:00
Adriane Boyd	656574a01a	Update Japanese tests (#5807 ) * Update POS tests to reflect current behavior (it is not entirely clear whether the AUX/VERB mapping is indeed the desired behavior?) * Switch to `from_config` initialization in subtoken test	2020-07-24 12:45:14 +02:00
Adriane Boyd	fdb8815ef5	Minor refactor for Morphology and MorphAnalysis (#5804 ) * `MorphAnalysis.get` returns only the field values * Move `_normalize_props` inside `Morphology` as `Morphology.normalize_attrs` and simplify * Simplify POS field detection/conversion * Convert all non-POS features to strings * `Morphology` returns an empty string for a missing morph to align with the FEATS string returned for an existing morph * Remove unused `list_to_feats`	2020-07-24 09:28:06 +02:00
Ines Montani	87737a5a60	Tidy up	2020-07-23 00:16:23 +02:00
Ines Montani	a624ae0675	Remove POS, TAG and LEMMA from tokenizer exceptions	2020-07-22 23:09:01 +02:00
Ines Montani	14d7d46f89	Merge branch 'develop' into feature/language-data-config	2020-07-22 22:18:53 +02:00
Ines Montani	b507f61629	Tidy up and move noun_chunks, token_match, url_match	2020-07-22 22:18:46 +02:00
Ines Montani	7fc4dadd22	Fix typo	2020-07-22 20:27:22 +02:00
Ines Montani	d0c6d1efc5	@factories -> factory (#5801 )	2020-07-22 17:29:31 +02:00
Ines Montani	2c5bb59909	Use consistent --gpu-id option name	2020-07-22 16:53:41 +02:00
Ines Montani	0fcd352179	Remove omit_extra_lookups	2020-07-22 16:01:17 +02:00
Ines Montani	945f795a3e	WIP: move more language data to config	2020-07-22 15:59:37 +02:00
Adriane Boyd	b84fd70cc3	Fix exceptions for Morphology.__reduce__ (#5792 ) Pickle exceptions in the MORPH_RULES format instead of the internal format after the recent `Morphology.__init__` changes.	2020-07-22 15:00:25 +02:00
Ines Montani	43b960c01b	Refactor pipeline components, config and language data (#5759 ) * Update with WIP * Update with WIP * Update with pipeline serialization * Update types and pipe factories * Add deep merge, tidy up and add tests * Fix pipe creation from config * Don't validate default configs on load * Update spacy/language.py Co-authored-by: Ines Montani <ines@ines.io> * Adjust factory/component meta error * Clean up factory args and remove defaults * Add test for failing empty dict defaults * Update pipeline handling and methods * provide KB as registry function instead of as object * small change in test to make functionality more clear * update example script for EL configuration * Fix typo * Simplify test * Simplify test * splitting pipes.pyx into separate files * moving default configs to each component file * fix batch_size type * removing default values from component constructors where possible (TODO: test 4725) * skip instead of xfail * Add test for config -> nlp with multiple instances * pipeline.pipes -> pipeline.pipe * Tidy up, document, remove kwargs * small cleanup/generalization for Tok2VecListener * use DEFAULT_UPSTREAM field * revert to avoid circular imports * Fix tests * Replace deprecated arg * Make model dirs require config * fix pickling of keyword-only arguments in constructor * WIP: clean up and integrate full config * Add helper to handle function args more reliably Now also includes keyword-only args * Fix config composition and serialization * Improve config debugging and add visual diff * Remove unused defaults and fix type * Remove pipeline and factories from meta * Update spacy/default_config.cfg Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update spacy/default_config.cfg * small UX edits * avoid printing stack trace for debug CLI commands * Add support for language-specific factories * specify the section of the config which holds the model to debug * WIP: add Language.from_config * Update with language data refactor WIP * Auto-format * Add backwards-compat handling for Language.factories * Update morphologizer.pyx * Fix morphologizer * Update and simplify lemmatizers * Fix Japanese tests * Port over tagger changes * Fix Chinese and tests * Update to latest Thinc * WIP: xfail first Russian lemmatizer test * Fix component-specific overrides * fix nO for output layers in debug_model * Fix default value * Fix tests and don't pass objects in config * Fix deep merging * Fix lemma lookup data registry Only load the lookups if an entry is available in the registry (and if spacy-lookups-data is installed) * Add types * Add Vocab.from_config * Fix typo * Fix tests * Make config copying more elegant * Fix pipe analysis * Fix lemmatizers and is_base_form * WIP: move language defaults to config * Fix morphology type * Fix vocab * Remove comment * Update to latest Thinc * Add morph rules to config * Tidy up * Remove set_morphology option from tagger factory * Hack use_gpu * Move [pipeline] to top-level block and make [nlp.pipeline] list Allows separating component blocks from component order – otherwise, ordering the config would mean a changed component order, which is bad. Also allows initial config to define more components and not use all of them * Fix use_gpu and resume in CLI * Auto-format * Remove resume from config * Fix formatting and error * [pipeline] -> [components] * Fix types * Fix tagger test: requires set_morphology? Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com> Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-07-22 13:42:59 +02:00
Ines Montani	311d0bde29	Merge pull request #5788 from explosion/master-tmp	2020-07-20 15:39:24 +02:00
Ines Montani	d51db72e46	Remove Python 2 marker	2020-07-20 15:01:36 +02:00
Ines Montani	644074b954	Merge branch 'develop' into master-tmp	2020-07-20 14:58:04 +02:00
Sofie Van Landeghem	c9da9605f7	Test suite clean up (#5781 ) * step_through tests: skip instead of xfail * test_empty_doc should be fixed with new Thinc version * remove outdated test (there are other misaligned tests now) * xfail reason * fix test according to french exceptions * clarified some skipped tests * skip ukranian test instead of xfail * skip instead of xfail * skip + reason instead of xfail * removed obsolete tests referring to removed "set_frozen" functionality * fix test 999 * remove unused AlignmentError * remove xfail where possible, skip otherwise * increment thinc release for empty_doc test	2020-07-20 14:49:54 +02:00
Sofie Van Landeghem	1b2ec94382	Hyphen infix (#5770 ) * infix split on hyphen when preceded by number * clean up * skip ukranian test instead of xfail	2020-07-20 14:48:51 +02:00
Adriane Boyd	ec819fc311	Provide default output for evaluate in CLI (#5784 )	2020-07-20 14:42:46 +02:00
Ines Montani	cb65b36839	Merge pull request #5767 from adrianeboyd/feature/remove-tag-maps	2020-07-19 15:15:34 +02:00
Ines Montani	fa3c98f8b3	Update train.py	2020-07-19 13:40:47 +02:00
Ines Montani	796f6c52d1	Merge branch 'develop' into pr/5767	2020-07-19 13:37:46 +02:00
Adriane Boyd	39ebcd9ec9	Refactor Chinese tokenizer configuration (#5736 ) * Refactor Chinese tokenizer configuration Refactor `ChineseTokenizer` configuration so that it uses a single `segmenter` setting to choose between character segmentation, jieba, and pkuseg. * replace `use_jieba`, `use_pkuseg`, `require_pkuseg` with the setting `segmenter` with the supported values: `char`, `jieba`, `pkuseg` * make the default segmenter plain character segmentation `char` (no additional libraries required) * Fix Chinese serialization test to use char default * Warn if attempting to customize other segmenter Add a warning if `Chinese.pkuseg_update_user_dict` is called when another segmenter is selected.	2020-07-19 13:34:37 +02:00
Adriane Boyd	9ee1c54f40	Improve tag map initialization and updating (#5764 ) * Improve tag map initialization and updating Generalize tag map initialization and updating so that the tag map can be loaded correctly prior to loading a `Corpus` with `spacy debug-data` and `spacy train`. * normalize provided tag map as necessary * use the same method for initializing and updating the tag map * Replace rather than update tag map Replace rather than update tag map when loading a custom tag map. Updating the tag map is problematic due to the sorted list of tag names and the fact that the tag map will contain lingering/unwanted tags from the default tag map. * Update CLI scripts * Reinitialize cache after loading new tag map Reinitialize the cache with the right size after loading a new tag map.	2020-07-19 13:13:57 +02:00
Adriane Boyd	597bcc629e	Improve tag map initialization and updating (#5768 ) * Improve tag map initialization and updating Generalize tag map initialization and updating so that a provided tag map can be loaded correctly in the CLI. * normalize provided tag map as necessary * use the same method for initializing and overwriting the tag map * Reinitialize cache after loading new tag map Reinitialize the cache with the right size after loading a new tag map.	2020-07-19 11:13:39 +02:00
Adriane Boyd	b81a89f0a9	Update morphologizer (#5766 ) * update `Morphologizer.begin_training` for use with `Example` * make init and begin_training more consistent * add `Morphology.normalize_features` to normalize outside of `Morphology.add` * make sure `get_loss` doesn't create unknown labels when the POS and morph alignments differ	2020-07-19 11:10:51 +02:00
Adriane Boyd	cd5af72c9a	Update pkuseg version (#5774 ) * Update pkuseg version in Chinese tokenizer warnings * Update pkuseg version in `Makefile` * Remove warning about python3.8 wheels in docs	2020-07-19 11:09:49 +02:00
Adriane Boyd	50db3f0cdb	Serialize morph rules with tagger Serialize `morph_rules` with the tagger alongside the `tag_map`. Use `Morphology.load_tag_map` and `Morphology.load_morph_exceptions` to load these settings rather than reinitializing the morphology each time they are changed.	2020-07-17 08:22:21 +02:00
Adriane Boyd	d106cf66dd	Update Morphology to load exceptions as MORPH_RULES Update `Morphology` to load exceptions in `Morphology.__init__` and `Morphology.load_morph_exceptions` from the format used in `MORPH_RULES` rather than the internal format with tuple keys. * Rename to `Morphology.exc` to `Morphology._exc` for internal use with tuple keys * Add `Morphology.exc` as a property that converts the internal `_exc` back to `MORPH_RULES` format, primarily for serialization	2020-07-16 21:16:49 +02:00
Adriane Boyd	d83e3c44c5	Remove corpus-specific morph rules * Remove corpus-specific morph rules * Add options similar to tag maps to provide them in the `train` and `debug-data` CLIs	2020-07-15 19:44:18 +02:00
Adriane Boyd	2f981d5af1	Remove corpus-specific tag maps Remove corpus-specific tag maps from the language data for languages without custom tokenizers. For languages with custom word segmenters that also provide tags (Japanese and Korean), the tag maps for the custom tokenizers are kept as the default. The default tag maps for languages without custom tokenizers are now the default tag map from `lang/tag_map/py`, UPOS -> UPOS.	2020-07-15 15:58:29 +02:00
Adriane Boyd	5228920e2f	Clarify warning W030 for misaligned BILUO tags (#5761 )	2020-07-14 14:09:48 +02:00
Adriane Boyd	a7a7e0d2a6	Add morph to morphology in Doc.from_array (#5762 ) * Add morph to morphology in Doc.from_array Add morphological analyses to morphology table in `Doc.from_array`. * Use separate vocab in DocBin roundtrip test	2020-07-14 14:07:35 +02:00
Ines Montani	872938ec76	Merge pull request #5747 from explosion/feature/refactor-config-args	2020-07-14 00:00:22 +02:00
Sofie Van Landeghem	6f3bb6f77c	fix doc.to_utf8 on GPU (#5757 )	2020-07-13 23:05:33 +02:00
Adriane Boyd	7ea2cc7650	Set version to 2.3.2 (#5756 )	2020-07-13 14:55:56 +02:00
Mark Neumann	27a1cd3c63	fix meta serialization in train (#5751 ) Co-authored-by: Mark Neumann <markng@allenai.org>	2020-07-12 22:06:46 +02:00
Ines Montani	ed55143c0d	Merge branch 'develop' into compat/remove-object-subclass	2020-07-12 14:28:52 +02:00

1 2 3 4 5 ...

7506 Commits