spaCy

mirror of https://github.com/explosion/spaCy.git synced 2024-12-25 17:36:30 +03:00

Author	SHA1	Message	Date
Sofie Van Landeghem	009ba14aaf	Fix pretraining in train script (#6143 ) * update pretraining API in train CLI * bump thinc to 8.0.0a35 * bump to 3.0.0a26 * doc fixes * small doc fix	2020-09-25 15:47:10 +02:00
Adriane Boyd	59340606b7	Add option to disable Matcher errors (#6125 ) * Add option to disable Matcher errors * Add option to disable Matcher errors when a doc doesn't contain a particular type of annotation Minor additional change: * Update `AttributeRuler.load_from_morph_rules` to allow direct `MORPH` values * Rename suppress_errors to allow_missing Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com> * Refactor annotation checks in Matcher and PhraseMatcher Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-09-24 16:54:39 +02:00
Sofie Van Landeghem	c7eedd3534	updates to NEL functionality (#6132 ) * NEL: read sentences and ents from reference * fiddling with sent_start annotations * add KB serialization test * KB write additional file with strings.json * score_links function to calculate NEL P/R/F * formatting * documentation	2020-09-24 16:53:59 +02:00
Ines Montani	58dde293ce	Merge pull request #6089 from adrianeboyd/feature/doc-ents-v3-2	2020-09-24 14:44:42 +02:00
Ines Montani	92f8b6959a	Fix typo	2020-09-24 13:48:41 +02:00
Adriane Boyd	5c13e0cf1b	Remove unused error	2020-09-24 13:41:55 +02:00
Ines Montani	be56c0994b	Add [training.before_to_disk] callback	2020-09-24 12:40:25 +02:00
Ines Montani	f69fea8b25	Improve error handling around non-number scores	2020-09-24 11:29:07 +02:00
Ines Montani	4eb39b5c43	Fix logging	2020-09-24 11:04:35 +02:00
svlandeg	25b34bba94	throw custom error when state_type is invalid	2020-09-23 16:57:14 +02:00
Adriane Boyd	b1a7d6c528	Refactor seen token detection	2020-09-22 14:42:51 +02:00
Adriane Boyd	535842e483	Merge branch 'develop' into feature/doc-ents-v3-2	2020-09-22 13:45:50 +02:00
svlandeg	b556a10808	rename converts in_to_out	2020-09-22 11:50:19 +02:00
Ines Montani	49e80dbcac	Merge pull request #6103 from explosion/chore/tidy-up-tests-docs-get-doc	2020-09-22 09:45:04 +02:00
Ines Montani	81606b29bd	Merge pull request #6104 from svlandeg/fix/debug_model [ci skip]	2020-09-22 09:31:23 +02:00
Ines Montani	67fbcb3da5	Tidy up tests and docs	2020-09-21 20:43:54 +02:00
Adriane Boyd	177df15d89	Implement Doc.set_ents	2020-09-21 15:54:05 +02:00
svlandeg	eb9b447960	Merge remote-tracking branch 'upstream/develop' into fix/debug_model # Conflicts: # spacy/cli/debug_model.py	2020-09-21 14:05:16 +02:00
Adriane Boyd	bc02e86494	Extend Doc.__init__ with additional annotation Mostly copying from `spacy.tests.util.get_doc`, add additional kwargs to `Doc.__init__` to initialize the most common doc/token values.	2020-09-21 13:36:24 +02:00
svlandeg	73ff52b9ec	hack for tok2vec listener	2020-09-18 16:43:15 +02:00
Adriane Boyd	a88106e852	Remove W106: HEAD and SENT_START in doc.from_array (#6086 ) * Remove W106: HEAD and SENT_START in doc.from_array This warning was hacky and being triggered too often. * Fix test	2020-09-18 03:01:29 +02:00
Adriane Boyd	7e4cd7575c	Refactor Docs.is_ flags (#6044 ) * Refactor Docs.is_ flags * Add derived `Doc.has_annotation` method * `Doc.has_annotation(attr)` returns `True` for partial annotation * `Doc.has_annotation(attr, require_complete=True)` returns `True` for complete annotation * Add deprecation warnings to `is_tagged`, `is_parsed`, `is_sentenced` and `is_nered` * Add `Doc._get_array_attrs()`, which returns a full list of `Doc` attrs for use with `Doc.to_array`, `Doc.to_bytes` and `Doc.from_docs`. The list is the `DocBin` attributes list plus `SPACY` and `LENGTH`. Notes on `Doc.has_annotation`: * `HEAD` is converted to `DEP` because heads don't have an unset state * Accept `IS_SENT_START` as a synonym of `SENT_START` Additional changes: * Add `NORM`, `ENT_ID` and `SENT_START` to default attributes for `DocBin` * In `Doc.from_array()` the presence of `DEP` causes `HEAD` to override `SENT_START` * In `Doc.from_array()` using `attrs` other than `Doc._get_array_attrs()` (i.e., a user's custom list rather than our default internal list) with both `HEAD` and `SENT_START` shows a warning that `HEAD` will override `SENT_START` * `set_children_from_heads` does not require dependency labels to set sentence boundaries and sets `sent_start` for all non-sentence starts to `-1` * Fix call to set_children_form_heads Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-09-17 00:14:01 +02:00
Ines Montani	aaf01689a1	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2020-09-15 14:24:42 +02:00
Ines Montani	d3d7f92f05	Fix lang check and error handling in Language.from_config	2020-09-15 14:24:06 +02:00
Ines Montani	253ba5ef14	Raise for bad Vocab values	2020-09-15 13:25:34 +02:00
Sofie Van Landeghem	3216a33149	positive_label config for textcat (#6062 ) * hook up positive_label in textcat * unit tests * documentation * formatting * tests * fix typo * move verify_config to after begin_training * revert accidential commit	2020-09-14 17:08:00 +02:00
Adriane Boyd	ab270364f1	Modify Token.morph to enable unsetting (#6043 ) Modify `Token.morph` property so that `Token.c.morph` can be reset back to an internal value of `0`. Allow setting `Token.morph` from a hash as long as the morph string is already in the `StringStore`, setting it indirectly through `Token.morph_` so that the value is added to the morphology. If the hash is not in the `StringStore`, raise an error.	2020-09-13 14:06:07 +02:00
Sofie Van Landeghem	8e7557656f	Renaming gold & annotation_setter (#6042 ) * version bump to 3.0.0a16 * rename "gold" folder to "training" * rename 'annotation_setter' to 'set_extra_annotations' * formatting	2020-09-09 10:31:03 +02:00
Sofie Van Landeghem	60f22e1800	Pipe API (#6034 ) * ensure Language passes on valid examples for initialization * fix tagger model initialization * check for valid get_examples across components * assume labels were added before begin_training * fix senter initialization * fix morphologizer initialization * use methods to check arguments * test textcat init, requires thinc>=8.0.0a31 * fix tok2vec init * fix entity linker init * use islice * fix simple NER * cleanup debug model * fix assert statements * fix tests * throw error when adding a label if the output layer can't be resized anymore * fix test * add failing test for simple_ner * UX improvements * morphologizer UX * assume begin_training gets a representative set and processes the labels * remove assumptions for output of untrained NER model * restore test for original purpose	2020-09-08 22:44:25 +02:00
Ines Montani	f174c7b1f3	Merge branch 'develop' into pr/6018	2020-09-04 15:54:49 +02:00
Ines Montani	864a697e63	Merge branch 'develop' into master-tmp	2020-09-04 13:15:36 +02:00
Adriane Boyd	b927893309	Merge branch 'develop' into feature/dependency-matcher-v3	2020-09-04 13:03:30 +02:00
Ines Montani	ab1bb421ed	Update docs links in codebase	2020-09-04 12:58:50 +02:00
Ines Montani	595f9dc2e4	Make displacy color registry consistent with others This was the only registry that expected the registered objects to be dictionaries instead of functions that return something. We can still support plain dicts but we should also support functions for consistency	2020-09-03 23:05:41 +02:00
Adriane Boyd	960d9cfadc	Officially support DependencyMatcher Add official support for the `DependencyMatcher`. Redesign the pattern specification. Fix and extend operator implementations. Update API docs and add usage docs. Patterns -------- Refactor pattern structure to: ``` { "LEFT_ID": str, "REL_OP": str, "RIGHT_ID": str, "RIGHT_ATTRS": dict, } ``` The first node contains only `RIGHT_ID` and `RIGHT_ATTRS` and all subsequent nodes contain all four keys. New operators ------------- Because of the way patterns are constructed from left to right, it's helpful to have `follows` operators along with `precedes` operators. Add operators for simple precedes / follows alongside immediate precedes / follows. * `.`: precedes `;`: immediately follows * `;`: follows Operator fixes -------------- `<` and `<<` do not include the node itself * Fix reversed order for all operators involving linear precedence (`.`, all sibling operators) * Linear precedence operators do not match nodes outside the same parse Additional fixes ---------------- * Use v3 Matcher API * Support `get` and `remove` * Support pickling	2020-09-02 17:45:29 +02:00
Adriane Boyd	9130094199	Prevent Tagger model init with 0 labels (#5984 ) * Prevent Tagger model init with 0 labels Raise an error before trying to initialize a tagger model with 0 labels. * Add dummy tagger label for test * Remove tagless tagger model initializiation * Fix error number after merge * Add dummy tagger label to test * Fix formatting Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-08-31 21:24:33 +02:00
Ines Montani	add9de5487	Deprecate (Phrase)Matcher.pipe	2020-08-31 17:01:24 +02:00
Adriane Boyd	216efaf5f5	Restrict tokenizer exceptions to ORTH and NORM	2020-08-31 09:55:01 +02:00
Ines Montani	45f46a5c85	Merge pull request #5993 from explosion/feature/disabled-components	2020-08-29 15:58:41 +02:00
Ines Montani	34146750d4	Use frozen list with custom errors We don't want to break backwards compatibility too much but we also want to provide the best possible UX	2020-08-29 15:20:11 +02:00
Ines Montani	10da74382f	Raise if disabled components are removed before DisabledPipes.restore	2020-08-28 20:35:26 +02:00
Ines Montani	cad988da7f	Allow component decorators to re-run with same function	2020-08-28 16:27:22 +02:00
Matthew Honnibal	2d520d3b45	Remove unused error	2020-08-26 15:41:14 +02:00
Sofie Van Landeghem	358cbb21e3	Define candidate generator in EL config (#5876 ) * candidate generator as separate part of EL config * update comment * ent instead of str as input for candidate generation * Span instead of str: correct type indication * fix types * unit test to create new candidate generator * fix replace_pipe argument passing * move error message, general cleanup * add vocab back to KB constructor * provide KB as callable from Vocab arg * rename to kb_loader, fix KB serialization as part of the EL pipe * fix typo * reformatting * cleanup * fix comment * fix wrongly duplicated code from merge conflict * rename dump to to_disk * from_disk instead of load_bulk * update test after recent removal of set_morphology in tagger * remove old doc	2020-08-18 16:10:36 +02:00
Sofie Van Landeghem	688e77562b	Train CLI script fixes (#5931 ) * fix dash replacement in overrides arguments * perform interpolation on training config * make sure only .spacy files are read	2020-08-18 16:06:37 +02:00
Ines Montani	db2dbc8e59	Remove unused warning	2020-08-14 14:58:03 +02:00
Ines Montani	950832f087	Tidy up pipes (#5906 ) * Tidy up pipes * Fix init, defaults and raise custom errors * Update docs * Update docs [ci skip] * Apply suggestions from code review Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com> * Tidy up error handling and validation, fix consistency * Simplify get_examples check * Remove unused import [ci skip] Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-08-11 23:29:31 +02:00
Adriane Boyd	e962784531	Add Lemmatizer and simplify related components (#5848 ) * Add Lemmatizer and simplify related components * Add `Lemmatizer` pipe with `lookup` and `rule` modes using the `Lookups` tables. * Reduce `Tagger` to a simple tagger that sets `Token.tag` (no pos or lemma) * Reduce `Morphology` to only keep track of morph tags (no tag map, lemmatizer, or morph rules) * Remove lemmatizer from `Vocab` * Adjust many many tests Differences: * No default lookup lemmas * No special treatment of TAG in `from_array` and similar required * Easier to modify labels in a `Tagger` * No extra strings added from morphology / tag map * Fix test * Initial fix for Lemmatizer config/serialization * Adjust init test to be more generic * Adjust init test to force empty Lookups * Add simple cache to rule-based lemmatizer * Convert language-specific lemmatizers Convert language-specific lemmatizers to component lemmatizers. Remove previous lemmatizer class. * Fix French and Polish lemmatizers * Remove outdated UPOS conversions * Update Russian lemmatizer init in tests * Add minimal init/run tests for custom lemmatizers * Add option to overwrite existing lemmas * Update mode setting, lookup loading, and caching * Make `mode` an immutable property * Only enforce strict `load_lookups` for known supported modes * Move caching into individual `_lemmatize` methods * Implement strict when lang is not found in lookups * Fix tables/lookups in make_lemmatizer * Reallow provided lookups and allow for stricter checks * Add lookups asset to all Lemmatizer pipe tests * Rename lookups in lemmatizer init test * Clean up merge * Refactor lookup table loading * Add helper from `load_lemmatizer_lookups` that loads required and optional lookups tables based on settings provided by a config. Additional slight refactor of lookups: * Add `Lookups.set_table` to set a table from a provided `Table` * Reorder class definitions to be able to specify type as `Table` * Move registry assets into test methods * Refactor lookups tables config Use class methods within `Lemmatizer` to provide the config for particular modes and to load the lookups from a config. * Add pipe and score to lemmatizer * Simplify Tagger.score * Add missing import * Clean up imports and auto-format * Remove unused kwarg * Tidy up and auto-format * Update docstrings for Lemmatizer Update docstrings for Lemmatizer. Additionally modify `is_base_form` API to take `Token` instead of individual features. * Update docstrings * Remove tag map values from Tagger.add_label * Update API docs * Fix relative link in Lemmatizer API docs	2020-08-07 15:27:13 +02:00
svlandeg	0b4d1e1bc4	'debug data' instead of 'debug-data'	2020-08-06 15:47:31 +02:00
Ines Montani	0881455a5d	Update error message	2020-08-05 23:15:05 +02:00

1 2 3 4 5

244 Commits