spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-10-04 02:46:40 +03:00

History

Adriane Boyd e962784531 Add Lemmatizer and simplify related components (#5848 ) * Add Lemmatizer and simplify related components * Add `Lemmatizer` pipe with `lookup` and `rule` modes using the `Lookups` tables. * Reduce `Tagger` to a simple tagger that sets `Token.tag` (no pos or lemma) * Reduce `Morphology` to only keep track of morph tags (no tag map, lemmatizer, or morph rules) * Remove lemmatizer from `Vocab` * Adjust many many tests Differences: * No default lookup lemmas * No special treatment of TAG in `from_array` and similar required * Easier to modify labels in a `Tagger` * No extra strings added from morphology / tag map * Fix test * Initial fix for Lemmatizer config/serialization * Adjust init test to be more generic * Adjust init test to force empty Lookups * Add simple cache to rule-based lemmatizer * Convert language-specific lemmatizers Convert language-specific lemmatizers to component lemmatizers. Remove previous lemmatizer class. * Fix French and Polish lemmatizers * Remove outdated UPOS conversions * Update Russian lemmatizer init in tests * Add minimal init/run tests for custom lemmatizers * Add option to overwrite existing lemmas * Update mode setting, lookup loading, and caching * Make `mode` an immutable property * Only enforce strict `load_lookups` for known supported modes * Move caching into individual `_lemmatize` methods * Implement strict when lang is not found in lookups * Fix tables/lookups in make_lemmatizer * Reallow provided lookups and allow for stricter checks * Add lookups asset to all Lemmatizer pipe tests * Rename lookups in lemmatizer init test * Clean up merge * Refactor lookup table loading * Add helper from `load_lemmatizer_lookups` that loads required and optional lookups tables based on settings provided by a config. Additional slight refactor of lookups: * Add `Lookups.set_table` to set a table from a provided `Table` * Reorder class definitions to be able to specify type as `Table` * Move registry assets into test methods * Refactor lookups tables config Use class methods within `Lemmatizer` to provide the config for particular modes and to load the lookups from a config. * Add pipe and score to lemmatizer * Simplify Tagger.score * Add missing import * Clean up imports and auto-format * Remove unused kwarg * Tidy up and auto-format * Update docstrings for Lemmatizer Update docstrings for Lemmatizer. Additionally modify `is_base_form` API to take `Token` instead of individual features. * Update docstrings * Remove tag map values from Tagger.add_label * Update API docs * Fix relative link in Lemmatizer API docs		2020-08-07 15:27:13 +02:00
..
cli	Update CLI docs and evaluate command [ci skip]	2020-08-07 14:40:58 +02:00
displacy	Tidy up, autoformat, add types	2020-07-25 15:01:15 +02:00
gold	Add DocBin to/from_disk methods and update docs (#5892 )	2020-08-07 14:30:59 +02:00
lang	Add Lemmatizer and simplify related components (#5848 )	2020-08-07 15:27:13 +02:00
matcher	Use "raise ... from" in custom errors for better tracebacks	2020-08-05 23:53:21 +02:00
ml	Tidy up and auto-format	2020-08-05 16:00:59 +02:00
pipeline	Add Lemmatizer and simplify related components (#5848 )	2020-08-07 15:27:13 +02:00
tests	Add Lemmatizer and simplify related components (#5848 )	2020-08-07 15:27:13 +02:00
tokens	Add Lemmatizer and simplify related components (#5848 )	2020-08-07 15:27:13 +02:00
__init__.pxd	* Seems to be working after refactor. Need to wire up more POS tag features, and wire up save/load of POS tags.	2014-10-24 02:23:42 +11:00
__init__.py	Simplify config overrides in CLI and deserialization (#5880 )	2020-08-05 23:35:09 +02:00
__main__.py	Tidy up	2020-06-22 00:45:40 +02:00
about.py	Set version to v3.0.0a5	2020-07-25 14:06:01 +02:00
attrs.pxd	Merge branch 'develop' into master-tmp	2020-05-21 18:39:06 +02:00
attrs.pyx	Merge branch 'develop' into master-tmp	2020-05-21 18:39:06 +02:00
compat.py	Tidy up, autoformat, add types	2020-07-25 15:01:15 +02:00
default_config.cfg	Add Lemmatizer and simplify related components (#5848 )	2020-08-07 15:27:13 +02:00
errors.py	Add Lemmatizer and simplify related components (#5848 )	2020-08-07 15:27:13 +02:00
glossary.py	unicode -> str consistency	2020-05-24 17:20:58 +02:00
kb.pxd	Tidy up and avoid absolute spacy imports in core	2020-05-21 20:05:03 +02:00
kb.pyx	Default empty KB in EL component (#5872 )	2020-08-04 14:34:09 +02:00
language.py	Add Lemmatizer and simplify related components (#5848 )	2020-08-07 15:27:13 +02:00
lexeme.pxd	Merge branch 'develop' into master-tmp	2020-05-21 18:39:06 +02:00
lexeme.pyx	WIP: move more language data to config	2020-07-22 15:59:37 +02:00
lookups.py	Add Lemmatizer and simplify related components (#5848 )	2020-08-07 15:27:13 +02:00
morphology.pxd	Add Lemmatizer and simplify related components (#5848 )	2020-08-07 15:27:13 +02:00
morphology.pyx	Add Lemmatizer and simplify related components (#5848 )	2020-08-07 15:27:13 +02:00
parts_of_speech.pxd	Add support for Universal Dependencies v2.0	2017-03-03 13:17:34 +01:00
parts_of_speech.pyx	Drop Python 2.7 and 3.5 (#4828 )	2019-12-22 01:53:56 +01:00
pipe_analysis.py	Simplify pipe analysis	2020-08-01 13:40:06 +02:00
schemas.py	Add Lemmatizer and simplify related components (#5848 )	2020-08-07 15:27:13 +02:00
scorer.py	Be less choosy about reporting textcat scores (#5879 )	2020-08-06 16:24:13 +02:00
strings.pxd	Tidy up compiler flags and imports (#5071 )	2020-03-02 11:48:10 +01:00
strings.pyx	Update docstrings, docs and types	2020-07-29 11:36:42 +02:00
structs.pxd	Merge branch 'develop' into master-tmp	2020-05-21 18:39:06 +02:00
symbols.pxd	Merge branch 'develop' into master-tmp	2020-05-21 18:39:06 +02:00
symbols.pyx	Merge branch 'develop' into master-tmp	2020-05-21 18:39:06 +02:00
tokenizer.pxd	Remove dead and/or deprecated code (#5710 )	2020-07-06 13:06:25 +02:00
tokenizer.pyx	Remove unused imports	2020-08-06 19:30:47 +02:00
typedefs.pxd	Update spaCy for thinc 8.0.0 (#4920 )	2020-01-29 17:06:46 +01:00
typedefs.pyx	Tidy up rest	2017-10-27 21:07:59 +02:00
util.py	Merge pull request #5882 from explosion/feature/raise-from	2020-08-06 00:35:26 +02:00
vectors.pyx	Update docstrings, docs and types	2020-07-29 11:36:42 +02:00
vocab.pxd	Tidy up and move noun_chunks, token_match, url_match	2020-07-22 22:18:46 +02:00
vocab.pyx	Add Lemmatizer and simplify related components (#5848 )	2020-08-07 15:27:13 +02:00