spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-04-23 02:21:59 +03:00

Author	SHA1	Message	Date
Adriane Boyd	e962784531	Add Lemmatizer and simplify related components (#5848 ) * Add Lemmatizer and simplify related components * Add `Lemmatizer` pipe with `lookup` and `rule` modes using the `Lookups` tables. * Reduce `Tagger` to a simple tagger that sets `Token.tag` (no pos or lemma) * Reduce `Morphology` to only keep track of morph tags (no tag map, lemmatizer, or morph rules) * Remove lemmatizer from `Vocab` * Adjust many many tests Differences: * No default lookup lemmas * No special treatment of TAG in `from_array` and similar required * Easier to modify labels in a `Tagger` * No extra strings added from morphology / tag map * Fix test * Initial fix for Lemmatizer config/serialization * Adjust init test to be more generic * Adjust init test to force empty Lookups * Add simple cache to rule-based lemmatizer * Convert language-specific lemmatizers Convert language-specific lemmatizers to component lemmatizers. Remove previous lemmatizer class. * Fix French and Polish lemmatizers * Remove outdated UPOS conversions * Update Russian lemmatizer init in tests * Add minimal init/run tests for custom lemmatizers * Add option to overwrite existing lemmas * Update mode setting, lookup loading, and caching * Make `mode` an immutable property * Only enforce strict `load_lookups` for known supported modes * Move caching into individual `_lemmatize` methods * Implement strict when lang is not found in lookups * Fix tables/lookups in make_lemmatizer * Reallow provided lookups and allow for stricter checks * Add lookups asset to all Lemmatizer pipe tests * Rename lookups in lemmatizer init test * Clean up merge * Refactor lookup table loading * Add helper from `load_lemmatizer_lookups` that loads required and optional lookups tables based on settings provided by a config. Additional slight refactor of lookups: * Add `Lookups.set_table` to set a table from a provided `Table` * Reorder class definitions to be able to specify type as `Table` * Move registry assets into test methods * Refactor lookups tables config Use class methods within `Lemmatizer` to provide the config for particular modes and to load the lookups from a config. * Add pipe and score to lemmatizer * Simplify Tagger.score * Add missing import * Clean up imports and auto-format * Remove unused kwarg * Tidy up and auto-format * Update docstrings for Lemmatizer Update docstrings for Lemmatizer. Additionally modify `is_base_form` API to take `Token` instead of individual features. * Update docstrings * Remove tag map values from Tagger.add_label * Update API docs * Fix relative link in Lemmatizer API docs	2020-08-07 15:27:13 +02:00
Ines Montani	1d01d89b79	Update CLI docs and evaluate command [ci skip]	2020-08-07 14:40:58 +02:00
Ines Montani	ef2c67cca5	Add DocBin to/from_disk methods and update docs (#5892 ) * Add DocBin to/from_disk methods and update docs * Use DocBin.from_disk in Corpus	2020-08-07 14:30:59 +02:00
Ines Montani	4ca08c6d5d	Merge pull request #5891 from adrianeboyd/docs/attribute-ruler-api Add AttributeRuler API docs	2020-08-07 13:55:12 +02:00
Adriane Boyd	b8d0c23857	Add AttributeRuler API docs With additional minor updates to AttributeRuler docstrings.	2020-08-07 12:43:23 +02:00
svlandeg	b17db0e994	Merge remote-tracking branch 'upstream/develop' into feature/el-docs # Conflicts: # website/docs/usage/training.md	2020-08-06 19:48:52 +02:00
Adriane Boyd	06c3a5e048	Add pipe to AttributeRuler (#5889 )	2020-08-06 19:43:09 +02:00
Ines Montani	9b7f198390	Fix format	2020-08-06 19:30:53 +02:00
Ines Montani	3c4389110d	Remove unused imports	2020-08-06 19:30:47 +02:00
Matthew Honnibal	d4525816ef	Be less choosy about reporting textcat scores (#5879 ) * Set textcat scores more consistently * Refactor textcat scores * Fixes to scorer * Add comments * Add threshold * Rename just 'f' to micro_f in textcat scorer * Fix textcat score for two-class * Fix syntax * Fix textcat score * Fix docstring	2020-08-06 16:24:13 +02:00
svlandeg	0b4d1e1bc4	'debug data' instead of 'debug-data'	2020-08-06 15:47:31 +02:00
svlandeg	881e3f8fd0	add docbin explanation and example	2020-08-06 15:29:44 +02:00
Adriane Boyd	5e683a6e46	Fix return values for per feat score (#5885 ) * Fix return values for per feat score Convert `PRFScore` to dict as other per type scores. * Update tests accordingly	2020-08-06 15:14:47 +02:00
Ines Montani	913d21f0a3	Merge pull request #5882 from explosion/feature/raise-from Use "raise ... from" in custom errors for better tracebacks	2020-08-06 00:35:26 +02:00
Ines Montani	06e80d95cd	Sync develop with nightly docs state (#5883 ) Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>	2020-08-06 00:28:14 +02:00
Ines Montani	d92954ac1d	Merge pull request #5881 from explosion/feature/better-error-model-shortcuts	2020-08-06 00:13:35 +02:00
Ines Montani	56c17973aa	Use "raise ... from" in custom errors for better tracebacks	2020-08-05 23:53:21 +02:00
Ines Montani	5cc0d89fad	Simplify config overrides in CLI and deserialization (#5880 )	2020-08-05 23:35:09 +02:00
Ines Montani	0881455a5d	Update error message	2020-08-05 23:15:05 +02:00
Ines Montani	2a1fa86a0d	Add better error for failed model shortcut loading	2020-08-05 23:10:29 +02:00
Ines Montani	c675746ca2	Update docstrings and types	2020-08-05 20:29:46 +02:00
Ines Montani	823e533dc1	Add config callbacks for modifying nlp object before and after init (#5866 ) * WIP: Concept for modifying nlp object before and after init * Make callbacks return nlp object Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com> * Raise if callbacks don't return correct type * Rename, update types, add after_pipeline_creation Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-08-05 19:47:54 +02:00
Ines Montani	586d695775	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2020-08-05 16:01:11 +02:00
Ines Montani	e68459296d	Tidy up and auto-format	2020-08-05 16:00:59 +02:00
Matthew Honnibal	50c0e49741	Fix train CLI	2020-08-05 15:40:47 +02:00
Matthew Honnibal	b9df4d6116	Fix textcat.begin_training if vectors set	2020-08-05 15:40:36 +02:00
Adriane Boyd	af125875cf	Update SimpleNER (#5878 ) * Fix `get_loss` to use NER annotation * Add labels as part of cfg * Add simple overfitting test	2020-08-05 14:43:29 +02:00
Sofie Van Landeghem	b88c5c701a	Bugfix in nlp.replace_pipe (#5875 ) * bugfix and unit test * merge two conditions	2020-08-05 09:30:58 +02:00
Ines Montani	b795f02fbd	Allow adding pipeline components from source model (#5857 ) * Allow adding pipeline components from source model * Config: name -> component * Improve error messages * Fix error and test * Add frozen components and exclude logic * Remove exclude from Language.evaluate * Init sourced components with current vocab * Fix error codes	2020-08-04 23:39:19 +02:00
Sofie Van Landeghem	34873c4911	Example Dict format consistency (#5858 ) * consistently use upper-case IDS in token_annotation format and for get_aligned * remove ID from to_dict (not used in from_dict either) * fix test Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-08-04 22:22:26 +02:00
Adriane Boyd	fa79a0db9f	Add AttributeRuler for token attribute exceptions (#5842 ) * Add AttributeRuler for token attribute exceptions Add the `AttributeRuler` to handle exceptions for token-level attributes. The `AttributeRuler` uses `Matcher` patterns to identify target spans and applies the specified attributes to the token at the provided index in the matched span. A negative index can be used to index from the end of the matched span. The retokenizer is used to "merge" the individual tokens and assign them the provided attributes. Helper functions can import existing tag maps and morph rules to the corresponding `Matcher` patterns. There is an additional minor bug fix for `MORPH` attributes in the retokenizer to correctly normalize the values and to handle `MORPH` alongside `_` in an attrs dict. * Fix default name * Update name in error message * Extend AttributeRuler functionality * Add option to initialize with a dict of AttributeRuler patterns * Instead of silently discarding overlapping matches (the default behavior for the retokenizer if only the attrs differ), split the matches into disjoint sets and retokenize each set separately. This allows, for instance, one pattern to set the POS and another pattern to set the lemma. (If two matches modify the same attribute, it looks like the attrs are applied in the order they were added, but it may not be deterministic?) * Improve types * Sort spans before processing * Fix index boundaries in Span * Refactor retokenizer to separate attrs methods Add top-level `normalize_token_attrs` and `set_token_attrs` methods. * Update AttributeRuler to use refactored methods Update `AttributeRuler` to replace use of full retokenizer with only the relevant methods for normalizing and setting attributes for a single token. * Update spacy/pipeline/attributeruler.py Co-authored-by: Ines Montani <ines@ines.io> * Make API more similar to EntityRuler * Add `AttributeRuler.add_patterns` to add patterns from a list of dicts * Return list of dicts as property `AttributeRuler.patterns` * Make attrs_unnormed private * Add test loading patterns from assets * Revert "Fix index boundaries in Span" This reverts commit `8f8a5c3386`. * Add Span index boundary checks (#5861) * Add Span index boundary checks * Return Span-specific IndexError in all cases * Simplify and fix if/else Co-authored-by: Ines Montani <ines@ines.io>	2020-08-04 17:02:39 +02:00
Sofie Van Landeghem	492d1ec5de	Prevent alignment when texts don't match (#5867 ) * remove empty gold.pyx * add alignment unit test (to be used in docs) * ensure that Alignment is only used on equal texts * additional test using example.alignment * formatting Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-08-04 16:29:18 +02:00
Matthew Honnibal	ecb3c4e8f4	Create corpus iterator and batcher from registry during training (#5865 ) * Move batchers into their own module (and registry) * Update CLI * Update Corpus and batcher * Update tests * Update one config * Merge 'evaluation' block back under [training] * Import batchers in gold __init__ * Fix batchers * Update config * Update schema * Update util * Don't assume train and dev are actually paths * Update onto-joint config * Fix missing import * Format * Format * Update spacy/gold/corpus.py Co-authored-by: Ines Montani <ines@ines.io> * Fix name * Update default config * Fix get_length option in batchers * Update test * Add comment * Pass path into Corpus * Update docstring * Update schema and configs * Update config * Fix test * Fix paths * Fix print * Fix create_train_batches * [training.read_train] -> [training.train_corpus] * Update onto-joint config Co-authored-by: Ines Montani <ines@ines.io>	2020-08-04 15:09:37 +02:00
Sofie Van Landeghem	82347110f5	Default empty KB in EL component (#5872 ) * EL field documentation * documentation consistent with docs * default empty KB, initialize vocab separately * formatting * add test for changing the default entity vector length * update comment	2020-08-04 14:34:09 +02:00
Adriane Boyd	b7e3018d97	Recalculate alignment if tokenization differs (#5868 ) * Recalculate alignment if tokenization differs * Refactor cached alignment data	2020-08-04 14:31:32 +02:00
Ines Montani	934447a611	Merge pull request #5855 from svlandeg/fix/cli-debug	2020-08-03 13:09:20 +02:00
Ines Montani	4c055f0aa7	Add init CLI and init config (#5854 ) * Add init CLI and init config draft * Improve config validation * Auto-format * Don't export anything in debug config * Update docs	2020-08-02 15:18:30 +02:00
svlandeg	6f4e46ee93	Merge remote-tracking branch 'upstream/develop' into fix/cli-debug # Conflicts: # pyproject.toml # requirements.txt # setup.cfg	2020-08-01 18:38:59 +02:00
Ines Montani	b40f44419b	Simplify pipe analysis - remove unused code - don't print by default - integrate attrs info into analysis output	2020-08-01 13:40:06 +02:00
Ines Montani	b68c53858c	Remove global	2020-07-31 18:37:58 +02:00
Ines Montani	30a76fcf6f	Integrate and simplify pipe analysis	2020-07-31 18:34:35 +02:00
svlandeg	9b719dfb1a	use divider inbetween steps	2020-07-31 18:06:48 +02:00
svlandeg	51ffc4a166	rename pipe_name to component	2020-07-31 17:58:55 +02:00
svlandeg	878327d38e	printing final predictions by default to False	2020-07-31 17:36:32 +02:00
Ines Montani	2d955fbf98	Fix linting [ci skip]	2020-07-31 17:05:28 +02:00
Ines Montani	e9e8fa2466	Update docs and types	2020-07-31 17:02:54 +02:00
svlandeg	cc2f58a1b0	use data_validation context manager	2020-07-31 16:49:42 +02:00
svlandeg	5fa3235d06	set DATA_VALIDATION to False for debug_model (upgrade thinc)	2020-07-31 15:21:01 +02:00
svlandeg	08d3c36c20	bugfix in train CLI	2020-07-31 15:03:43 +02:00
Adriane Boyd	9b509aa87f	Move Language.evaluate scorer config to new arg Move `Language.evaluate` scorer config from `component_cfg` to separate argument `scorer_cfg`.	2020-07-31 11:05:16 +02:00

1 2 3 4 5 ...

7432 Commits