spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-11-11 05:19:52 +03:00

Author	SHA1	Message	Date
Ines Montani	8d56260d92	Fix docstrings [ci skip]	2020-07-29 14:07:13 +02:00
Ines Montani	80b18124d2	Fix docstring [ci skip]	2020-07-29 14:03:35 +02:00
Matthew Honnibal	6a6b09bd32	Update morphologizer model	2020-07-29 14:01:12 +02:00
Matthew Honnibal	1784c95827	Clean up link_vectors_to_models unused stuff	2020-07-29 14:01:11 +02:00
Matthew Honnibal	9987ea9e4d	Fix Tok2Vec begin_training	2020-07-29 14:00:10 +02:00
Ines Montani	e257e66ab9	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2020-07-29 11:36:45 +02:00
Ines Montani	e0ffe36e79	Update docstrings, docs and types	2020-07-29 11:36:42 +02:00
Sofie Van Landeghem	40c995b1be	Option for returning only greedy matches (#5771 ) * add "greedy" option for match pattern * distinction between greedy FIRST or LONGEST * check for proper values, throw custom warning otherwise * unxfail one more test * add comment in docstring * add test that LONGEST also prefers first match if equal length * use c arrays for more efficient processing * rename 'greediness' to 'greedy'	2020-07-29 11:04:43 +02:00
Ines Montani	2c7a32cf12	Remove unused methods	2020-07-28 16:50:02 +02:00
Ines Montani	ae4d8a6ffd	Update docstrings, docs and pipe consistency	2020-07-28 13:37:31 +02:00
Ines Montani	894e20c466	Merge branch 'develop' into feature/component-scores	2020-07-27 18:14:39 +02:00
Ines Montani	d8b519c23c	API docs, docstrings and argument consistency	2020-07-27 18:11:45 +02:00
Adriane Boyd	34c92dfe63	Add missing Scorer imports	2020-07-27 15:08:51 +02:00
Adriane Boyd	8bb0507777	Add and update score methods and score weights Add and update `score` methods, provided `scores`, and default weights `default_score_weights` for pipeline components. * `scores` provides all top-level keys returned by `score` (merely informative, similar to `assigns`). * `default_score_weights` provides the default weights for a default config. * The keys from `default_score_weights` determine which values will be shown in the `spacy train` output, so keys with weight `0.0` will be displayed but not counted toward the overall score.	2020-07-27 14:44:53 +02:00
Ines Montani	ed61fb10fc	Rename default textcat arch to TextCatEnsemble	2020-07-26 15:11:43 +02:00
Ines Montani	2470486543	Allow pipeline components to set default scores and weights	2020-07-26 13:18:43 +02:00
Ines Montani	787d066e22	Remove pipes.pyx Probably accidentally re-added in a merge?	2020-07-26 13:08:52 +02:00
Ines Montani	e92df281ce	Tidy up, autoformat, add types	2020-07-25 15:01:15 +02:00
Ines Montani	cdbd6ba912	Merge pull request #5798 from explosion/feature/language-data-config	2020-07-25 13:34:49 +02:00
Adriane Boyd	2bcceb80c4	Refactor the Scorer to improve flexibility (#5731 ) * Refactor the Scorer to improve flexibility Refactor the `Scorer` to improve flexibility for arbitrary pipeline components. * Individual pipeline components provide their own `evaluate` methods that score a list of `Example`s and return a dictionary of scores * `Scorer` is initialized either: * with a provided pipeline containing components to be scored * with a default pipeline containing the built-in statistical components (senter, tagger, morphologizer, parser, ner) * `Scorer.score` evaluates a list of `Example`s and returns a dictionary of scores referring to the scores provided by the components in the pipeline Significant differences: * `tags_acc` is renamed to `tag_acc` to be consistent with `token_acc` and the new `morph_acc`, `pos_acc`, and `lemma_acc` * Scoring is no longer cumulative: `Scorer.score` scores a list of examples rather than a single example and does not retain any state about previously scored examples * PRF values in the returned scores are no longer multiplied by 100 * Add kwargs to Morphologizer.evaluate * Create generalized scoring methods in Scorer * Generalized static scoring methods are added to `Scorer` * Methods require an attribute (either on Token or Doc) that is used to key the returned scores Naming differences: * `uas`, `las`, and `las_per_type` in the scores dict are renamed to `dep_uas`, `dep_las`, and `dep_las_per_type` Scoring differences: * `Doc.sents` is now scored as spans rather than on sentence-initial token positions so that `Doc.sents` and `Doc.ents` can be scored with the same method (this lowers scores since a single incorrect sentence start results in two incorrect spans) * Simplify / extend hasattr check for eval method * Add hasattr check to tokenizer scoring * Simplify to hasattr check for component scoring * Reset Example alignment if docs are set Reset the Example alignment if either doc is set in case the tokenization has changed. * Add PRF tokenization scoring for tokens as spans Add PRF scores for tokens as character spans. The scores are: * token_acc: # correct tokens / # gold tokens * token_p/r/f: PRF for (token.idx, token.idx + len(token)) * Add docstring to Scorer.score_tokenization * Rename component.evaluate() to component.score() * Update Scorer API docs * Update scoring for positive_label in textcat * Fix TextCategorizer.score kwargs * Update Language.evaluate docs * Update score names in default config	2020-07-25 12:53:02 +02:00
Ines Montani	b9aaa4e457	Improve vocab data integration and warning	2020-07-25 11:51:30 +02:00
Ines Montani	43b960c01b	Refactor pipeline components, config and language data (#5759 ) * Update with WIP * Update with WIP * Update with pipeline serialization * Update types and pipe factories * Add deep merge, tidy up and add tests * Fix pipe creation from config * Don't validate default configs on load * Update spacy/language.py Co-authored-by: Ines Montani <ines@ines.io> * Adjust factory/component meta error * Clean up factory args and remove defaults * Add test for failing empty dict defaults * Update pipeline handling and methods * provide KB as registry function instead of as object * small change in test to make functionality more clear * update example script for EL configuration * Fix typo * Simplify test * Simplify test * splitting pipes.pyx into separate files * moving default configs to each component file * fix batch_size type * removing default values from component constructors where possible (TODO: test 4725) * skip instead of xfail * Add test for config -> nlp with multiple instances * pipeline.pipes -> pipeline.pipe * Tidy up, document, remove kwargs * small cleanup/generalization for Tok2VecListener * use DEFAULT_UPSTREAM field * revert to avoid circular imports * Fix tests * Replace deprecated arg * Make model dirs require config * fix pickling of keyword-only arguments in constructor * WIP: clean up and integrate full config * Add helper to handle function args more reliably Now also includes keyword-only args * Fix config composition and serialization * Improve config debugging and add visual diff * Remove unused defaults and fix type * Remove pipeline and factories from meta * Update spacy/default_config.cfg Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update spacy/default_config.cfg * small UX edits * avoid printing stack trace for debug CLI commands * Add support for language-specific factories * specify the section of the config which holds the model to debug * WIP: add Language.from_config * Update with language data refactor WIP * Auto-format * Add backwards-compat handling for Language.factories * Update morphologizer.pyx * Fix morphologizer * Update and simplify lemmatizers * Fix Japanese tests * Port over tagger changes * Fix Chinese and tests * Update to latest Thinc * WIP: xfail first Russian lemmatizer test * Fix component-specific overrides * fix nO for output layers in debug_model * Fix default value * Fix tests and don't pass objects in config * Fix deep merging * Fix lemma lookup data registry Only load the lookups if an entry is available in the registry (and if spacy-lookups-data is installed) * Add types * Add Vocab.from_config * Fix typo * Fix tests * Make config copying more elegant * Fix pipe analysis * Fix lemmatizers and is_base_form * WIP: move language defaults to config * Fix morphology type * Fix vocab * Remove comment * Update to latest Thinc * Add morph rules to config * Tidy up * Remove set_morphology option from tagger factory * Hack use_gpu * Move [pipeline] to top-level block and make [nlp.pipeline] list Allows separating component blocks from component order – otherwise, ordering the config would mean a changed component order, which is bad. Also allows initial config to define more components and not use all of them * Fix use_gpu and resume in CLI * Auto-format * Remove resume from config * Fix formatting and error * [pipeline] -> [components] * Fix types * Fix tagger test: requires set_morphology? Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com> Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-07-22 13:42:59 +02:00
Ines Montani	644074b954	Merge branch 'develop' into master-tmp	2020-07-20 14:58:04 +02:00
Ines Montani	796f6c52d1	Merge branch 'develop' into pr/5767	2020-07-19 13:37:46 +02:00
Adriane Boyd	b81a89f0a9	Update morphologizer (#5766 ) * update `Morphologizer.begin_training` for use with `Example` * make init and begin_training more consistent * add `Morphology.normalize_features` to normalize outside of `Morphology.add` * make sure `get_loss` doesn't create unknown labels when the POS and morph alignments differ	2020-07-19 11:10:51 +02:00
Adriane Boyd	50db3f0cdb	Serialize morph rules with tagger Serialize `morph_rules` with the tagger alongside the `tag_map`. Use `Morphology.load_tag_map` and `Morphology.load_morph_exceptions` to load these settings rather than reinitializing the morphology each time they are changed.	2020-07-17 08:22:21 +02:00
Ines Montani	5f6f4ff594	Remove object subclassing	2020-07-12 14:03:23 +02:00
Sofie Van Landeghem	dd207a28be	cleanup components API (#5726 ) * add keyword separator for update functions and drop unused "state" * few more Example tests and various small fixes * consistently return losses after update call * eliminate unused tensors field across pipe components * fix name * fix arg name	2020-07-09 19:43:39 +02:00
Adriane Boyd	ad15499b3b	Fix get_loss for values outside of labels in senter (#5730 ) * Fix get_loss for None alignments in senter When converting the `sent_start` values back to `SentenceRecognizer` labels, handle `None` alignments. * Handle SENT_START as -1 Handle SENT_START as -1 (or -1 converted to uint64) by treating any values other than 1 the same as 0 in `SentenceRecognizer.get_loss`.	2020-07-09 01:41:58 +02:00
Adriane Boyd	c9f0f75778	Update get_loss for senter and morphologizer (#5724 ) * Update get_loss for senter Update `SentenceRecognizer.get_loss` to keep it similar to `Tagger`. * Update get_loss for morphologizer Update `Morphologizer.get_loss` to keep it similar to `Tagger`.	2020-07-08 13:59:28 +02:00
Matthw Honnibal	a4164f67ca	Don't normalize gradients	2020-07-07 17:21:58 +02:00
Ines Montani	412dbb1f38	Remove dead and/or deprecated code (#5710 ) * Remove dead and/or deprecated code * Remove n_threads Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-07-06 13:06:25 +02:00
Sofie Van Landeghem	fcbf899b08	Feature/example only (#5707 ) * remove _convert_examples * fix test_gold, raise TypeError if tuples are used instead of Example's * throwing proper errors when the wrong type of objects are passed * fix deprectated format in tests * fix deprectated format in parser tests * fix tests for NEL, morph, senter, tagger, textcat * update regression tests with new Example format * use make_doc * more fixes to nlp.update calls * few more small fixes for rehearse and evaluate * only import ml_datasets if really necessary	2020-07-06 13:02:36 +02:00
Matthew Honnibal	3e78e82a83	Experimental character-based pretraining (#5700 ) * Use cosine loss in Cloze multitask * Fix char_embed for gpu * Call resume_training for base model in train CLI * Fix bilstm_depth default in pretrain command * Implement character-based pretraining objective * Use chars loss in ClozeMultitask * Add method to decode predicted characters * Fix number characters * Rescale gradients for mlm * Fix char embed+vectors in ml * Fix pipes * Fix pretrain args * Move get_characters_loss * Fix import * Fix import * Mention characters loss option in pretrain * Remove broken 'self attention' option in pretrain * Revert "Remove broken 'self attention' option in pretrain" This reverts commit `56b820f6af`. * Document 'characters' objective of pretrain	2020-07-05 15:48:39 +02:00
Sofie Van Landeghem	8d3c0306e1	refactor fixes (#5664 ) * fixes in ud_train, UX for morphs * update pyproject with new version of thinc * fixes in debug_data script * cleanup of old unused error messages * remove obsolete TempErrors * move error messages to errors.py * add ENT_KB_ID to default DocBin serialization * few fixes to simple_ner * fix tags	2020-06-29 14:33:00 +02:00
Matthew Honnibal	8c29268749	Improve spacy.gold (no GoldParse, no json format!) (#5555 ) * Update errors * Remove beam for now (maybe) Remove beam_utils Update setup.py Remove beam * Remove GoldParse WIP on removing goldparse Get ArcEager compiling after GoldParse excise Update setup.py Get spacy.syntax compiling after removing GoldParse Rename NewExample -> Example and clean up Clean html files Start updating tests Update Morphologizer * fix error numbers * fix merge conflict * informative error when calling to_array with wrong field * fix error catching * fixing language and scoring tests * start testing get_aligned * additional tests for new get_aligned function * Draft create_gold_state for arc_eager oracle * Fix import * Fix import * Remove TokenAnnotation code from nonproj * fixing NER one-to-many alignment * Fix many-to-one IOB codes * fix test for misaligned * attempt to fix cases with weird spaces * fix spaces * test_gold_biluo_different_tokenization works * allow None as BILUO annotation * fixed some tests + WIP roundtrip unit test * add spaces to json output format * minibatch utiltiy can deal with strings, docs or examples * fix augment (needs further testing) * various fixes in scripts - needs to be further tested * fix test_cli * cleanup * correct silly typo * add support for MORPH in to/from_array, fix morphologizer overfitting test * fix tagger * fix entity linker * ensure test keeps working with non-linked entities * pipe() takes docs, not examples * small bug fix * textcat bugfix * throw informative error when running the components with the wrong type of objects * fix parser tests to work with example (most still failing) * fix BiluoPushDown parsing entities * small fixes * bugfix tok2vec * fix renames and simple_ner labels * various small fixes * prevent writing dummy values like deps because that could interfer with sent_start values * fix the fix * implement split_sent with aligned SENT_START attribute * test for split sentences with various alignment issues, works * Return ArcEagerGoldParse from ArcEager * Update parser and NER gold stuff * Draft new GoldCorpus class * add links to to_dict * clean up * fix test checking for variants * Fix oracles * Start updating converters * Move converters under spacy.gold * Move things around * Fix naming * Fix name * Update converter to produce DocBin * Update converters * Allow DocBin to take list of Doc objects. * Make spacy convert output docbin * Fix import * Fix docbin * Fix compile in ArcEager * Fix import * Serialize all attrs by default * Update converter * Remove jsonl converter * Add json2docs converter * Draft Corpus class for DocBin * Work on train script * Update Corpus * Update DocBin * Allocate Doc before starting to add words * Make doc.from_array several times faster * Update train.py * Fix Corpus * Fix parser model * Start debugging arc_eager oracle * Update header * Fix parser declaration * Xfail some tests * Skip tests that cause crashes * Skip test causing segfault * Remove GoldCorpus * Update imports * Update after removing GoldCorpus * Fix module name of corpus * Fix mimport * Work on parser oracle * Update arc_eager oracle * Restore ArcEager.get_cost function * Update transition system * Update test_arc_eager_oracle * Remove beam test * Update test * Unskip * Unskip tests * add links to to_dict * clean up * fix test checking for variants * Allow DocBin to take list of Doc objects. * Fix compile in ArcEager * Serialize all attrs by default Move converters under spacy.gold Move things around Fix naming Fix name Update converter to produce DocBin Update converters Make spacy convert output docbin Fix import Fix docbin Fix import Update converter Remove jsonl converter Add json2docs converter * Allocate Doc before starting to add words * Make doc.from_array several times faster * Start updating converters * Work on train script * Draft Corpus class for DocBin Update Corpus Fix Corpus * Update DocBin Add missing strings when serializing * Update train.py * Fix parser model * Start debugging arc_eager oracle * Update header * Fix parser declaration * Xfail some tests Skip tests that cause crashes Skip test causing segfault * Remove GoldCorpus Update imports Update after removing GoldCorpus Fix module name of corpus Fix mimport * Work on parser oracle Update arc_eager oracle Restore ArcEager.get_cost function Update transition system * Update tests Remove beam test Update test Unskip Unskip tests * Add get_aligned_parse method in Example Fix Example.get_aligned_parse * Add kwargs to Corpus.dev_dataset to match train_dataset * Update nonproj * Use get_aligned_parse in ArcEager * Add another arc-eager oracle test * Remove Example.doc property Remove Example.doc Remove Example.doc Remove Example.doc Remove Example.doc * Update ArcEager oracle Fix Break oracle * Debugging * Fix Corpus * Fix eg.doc * Format * small fixes * limit arg for Corpus * fix test_roundtrip_docs_to_docbin * fix test_make_orth_variants * fix add_label test * Update tests * avoid writing temp dir in json2docs, fixing 4402 test * Update test * Add missing costs to NER oracle * Update test * Work on Example.get_aligned_ner method * Clean up debugging * Xfail tests * Remove prints * Remove print * Xfail some tests * Replace unseen labels for parser * Update test * Update test * Xfail test * Fix Corpus * fix imports * fix docs_to_json * various small fixes * cleanup * Support gold_preproc in Corpus * Support gold_preproc * Pass gold_preproc setting into corpus * Remove debugging * Fix gold_preproc * Fix json2docs converter * Fix convert command * Fix flake8 * Fix import * fix output_dir (converted to Path by typer) * fix var * bugfix: update states after creating golds to avoid out of bounds indexing * Improve efficiency of ArEager oracle * pull merge_sent into iob2docs to avoid Doc creation for each line * fix asserts * bugfix excl Span.end in iob2docs * Support max_length in Corpus * Fix arc_eager oracle * Filter out uannotated sentences in NER * Remove debugging in parser * Simplify NER alignment * Fix conversion of NER data * Fix NER init_gold_batch * Tweak efficiency of precomputable affine * Update onto-json default * Update gold test for NER * Fix parser test * Update test * Add NER data test * Fix convert for single file * Fix test * Hack scorer to avoid evaluating non-nered data * Fix handling of NER data in Example * Output unlabelled spans from O biluo tags in iob_utils * Fix unset variable * Return kept examples from init_gold_batch * Return examples from init_gold_batch * Dont return Example from init_gold_batch * Set spaces on gold doc after conversion * Add test * Fix spaces reading * Improve NER alignment * Improve handling of missing values in NER * Restore the 'cutting' in parser training * Add assertion * Print epochs * Restore random cuts in parser/ner training * Implement Doc.copy * Implement Example.copy * Copy examples at the start of Language.update * Don't unset example docs * Tweak parser model slightly * attempt to fix _guess_spaces * _add_entities_to_doc first, so that links don't get overwritten * fixing get_aligned_ner for one-to-many * fix indexing into x_text * small fix biluo_tags_from_offsets * Add onto-ner config * Simplify NER alignment * Fix NER scoring for partially annotated documents * fix indexing into x_text * fix test_cli failing tests by ignoring spans in doc.ents with empty label * Fix limit * Improve NER alignment * Fix count_train * Remove print statement * fix tests, we're not having nothing but None * fix clumsy fingers * Fix tests * Fix doc.ents * Remove empty docs in Corpus and improve limit * Update config Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>	2020-06-26 19:34:12 +02:00
Adriane Boyd	b7107ac89f	Disregard special tag _SP in check for new tag map (#5641 ) * Skip special tag _SP in check for new tag map In `Tagger.begin_training()` check for new tags aside from `_SP` in the new tag map initialized from the provided gold tuples when determining whether to reinitialize the morphology with the new tag map. * Simplify _SP check	2020-06-26 09:23:21 +02:00
svlandeg	2f6062a8a4	add line that got removed from EntityLinker	2020-06-20 23:14:45 +02:00
svlandeg	12dc8ab208	remove redundant code from master in EntityLinker	2020-06-20 23:07:42 +02:00
svlandeg	256d4c27c8	fix tagger begin_training being called without examples	2020-06-20 22:38:00 +02:00
svlandeg	c9242e9bf4	fix entity linker (cf PR #5548 )	2020-06-20 21:47:23 +02:00
Ines Montani	0cdb631e6c	Fix merge errors	2020-06-20 16:02:42 +02:00
Ines Montani	52728d8fa3	Merge branch 'develop' into master-tmp	2020-06-20 15:52:00 +02:00
Ines Montani	8283df80e9	Tidy up and auto-format	2020-06-20 14:15:04 +02:00
Adriane Boyd	c482f20778	Fix and add warnings related to spacy-lookups-data (#5588 ) * Fix warning message for lemmatization tables * Add a warning when the `lexeme_norm` table is empty. (Given the relatively lang-specific loading for `Lookups`, it seemed like too much overhead to dynamically extract the list of languages, so for now it's hard-coded.)	2020-06-15 14:56:04 +02:00
theudas	fa46e0bef2	Added Parameter to NEL to take n sentences into account (#5548 ) * added setting for neighbour sentence in NEL * added spaCy contributor agreement * added multi sentence also for training * made the try-except block smaller	2020-06-12 02:03:23 +02:00
Sofie Van Landeghem	c0f4a1e43b	train is from-config by default (#5575 ) * verbose and tag_map options * adding init_tok2vec option and only changing the tok2vec that is specified * adding omit_extra_lookups and verifying textcat config * wip * pretrain bugfix * add replace and resume options * train_textcat fix * raw text functionality * improve UX when KeyError or when input data can't be parsed * avoid unnecessary access to goldparse in TextCat pipe * save performance information in nlp.meta * add noise_level to config * move nn_parser's defaults to config file * multitask in config - doesn't work yet * scorer offering both F and AUC options, need to be specified in config * add textcat verification code from old train script * small fixes to config files * clean up * set default config for ner/parser to allow create_pipe to work as before * two more test fixes * small fixes * cleanup * fix NER pickling + additional unit test * create_pipe as before	2020-06-12 02:02:07 +02:00
Matthew Honnibal	8411d4f4e6	Merge pull request #5543 from svlandeg/feature/pretrain-config pretrain from config	2020-06-04 19:07:12 +02:00
svlandeg	3ade455fd3	formatting	2020-06-04 16:09:55 +02:00
Ines Montani	810fce3bb1	Merge branch 'develop' into master-tmp	2020-06-03 14:36:59 +02:00
svlandeg	eac12cbb77	make dropout in embed layers configurable	2020-06-03 11:50:16 +02:00
svlandeg	e0f9f448f1	remove Tensorizer	2020-06-01 23:38:48 +02:00
Adriane Boyd	a005ccd6d7	Preserve _SP when filtering tag map in Tagger To allow "SP" as a tag (for Chinese OntoNotes), preserve "_SP" if present as the reference `SPACE` POS in the tag map in `Tagger.begin_training()`.	2020-05-31 19:57:54 +02:00
Ines Montani	1a15896ba9	unicode -> str consistency [ci skip]	2020-05-24 18:51:10 +02:00
Ines Montani	5d3806e059	unicode -> str consistency	2020-05-24 17:20:58 +02:00
Matthew Honnibal	93c4d13588	Merge pull request #5264 from lfiedler/issue-5230 Fix ResourceWarnings during unittest	2020-05-22 00:31:07 +02:00
Matthw Honnibal	bc94fdabd0	Fix begin_training	2020-05-21 20:46:21 +02:00
Matthw Honnibal	f075655deb	Fix shape inference in begin_training	2020-05-21 19:26:29 +02:00
Ines Montani	24f72c669c	Merge branch 'develop' into master-tmp	2020-05-21 18:39:06 +02:00
Sofie Van Landeghem	7f5715a081	Various fixes to NEL functionality, Example class etc (#5460 ) * setting KB in the EL constructor, similar to how the model is passed on * removing wikipedia example files - moved to projects * throw an error when nlp.update is called with 2 positional arguments * rewriting the config logic in create pipe to accomodate for other objects (e.g. KB) in the config * update config files with new parameters * avoid training pipeline components that don't have a model (like sentencizer) * various small fixes + UX improvements * small fixes * set thinc to 8.0.0a9 everywhere * remove outdated comment	2020-05-20 11:41:12 +02:00
Sofie Van Landeghem	f00de445dd	default models defined in component decorator (#5452 ) * move defaults to pipeline and use in component decorator * black formatting * relative import	2020-05-19 16:20:03 +02:00
Sofie Van Landeghem	0d94737857	Feature toggle_pipes (#5378 ) * make disable_pipes deprecated in favour of the new toggle_pipes * rewrite disable_pipes statements * update documentation * remove bin/wiki_entity_linking folder * one more fix * remove deprecated link to documentation * few more doc fixes * add note about name change to the docs * restore original disable_pipes * small fixes * fix typo * fix error number to W096 * rename to select_pipes * also make changes to the documentation Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-05-18 22:27:10 +02:00
Matthew Honnibal	333b1a308b	Adapt parser and NER for transformers (#5449 ) * Draft layer for BILUO actions * Fixes to biluo layer * WIP on BILUO layer * Add tests for BILUO layer * Format * Fix transitions * Update test * Link in the simple_ner * Update BILUO tagger * Update __init__ * Import simple_ner * Update test * Import * Add files * Add config * Fix label passing for BILUO and tagger * Fix label handling for simple_ner component * Update simple NER test * Update config * Hack train script * Update BILUO layer * Fix SimpleNER component * Update train_from_config * Add biluo_to_iob helper * Add IOB layer * Add IOBTagger model * Update biluo layer * Update SimpleNER tagger * Update BILUO * Read random seed in train-from-config * Update use of normal_init * Fix normalization of gradient in SimpleNER * Update IOBTagger * Remove print * Tweak masking in BILUO * Add dropout in SimpleNER * Update thinc * Tidy up simple_ner * Fix biluo model * Unhack train-from-config * Update setup.cfg and requirements * Add tb_framework.py for parser model * Try to avoid memory leak in BILUO * Move ParserModel into spacy.ml, avoid need for subclass. * Use updated parser model * Remove incorrect call to model.initializre in PrecomputableAffine * Update parser model * Avoid divide by zero in tagger * Add extra dropout layer in tagger * Refine minibatch_by_words function to avoid oom * Fix parser model after refactor * Try to avoid div-by-zero in SimpleNER * Fix infinite loop in minibatch_by_words * Use SequenceCategoricalCrossentropy in Tagger * Fix parser model when hidden layer * Remove extra dropout from tagger * Add extra nan check in tagger * Fix thinc version * Update tests and imports * Fix test * Update test * Update tests * Fix tests * Fix test Co-authored-by: Ines Montani <ines@ines.io>	2020-05-18 22:23:33 +02:00
Adriane Boyd	bc39f97e11	Simplify warnings	2020-04-28 13:37:37 +02:00
Matthew Honnibal	b2ef6100af	Only run backprop once when shared tok2vec weights (#5331 ) Previously, pipelines with shared tok2vec weights would call the tok2vec backprop callback multiple times, once for each pipeline component. This caused errors for PyTorch, and was inefficient. Instead, accumulate the gradient for all but one component, and just call the callback once.	2020-04-21 19:30:41 +02:00
Leander Fiedler	a3401b1194	issue5230 changed reference to function to anonymous function	2020-04-15 21:52:52 +02:00
Leander Fiedler	cef0c909b9	issue5230 changed reference to function to anonymous function	2020-04-15 19:28:33 +02:00
adrianeboyd	ae4af52ce7	Add ideographic stops to sentencizer (#5263 ) Add ideographic half- and fullwidth full stops to default sentencizer punctuation.	2020-04-08 12:58:39 +02:00
Leander Fiedler	71cc903d65	issue5230: replaced open statements on path objects so that serialization still works an files are closed	2020-04-06 20:30:41 +02:00
adrianeboyd	b71a11ff6d	Update morphologizer (#5108 ) * Add pos and morph scoring to Scorer Add pos, morph, and morph_per_type to `Scorer`. Report pos and morph accuracy in `spacy evaluate`. * Update morphologizer for v3 * switch to tagger-based morphologizer * use `spacy.HashCharEmbedCNN` for morphologizer defaults * add `Doc.is_morphed` flag * Add morphologizer to train CLI * Add basic morphologizer pipeline tests * Add simple morphologizer training example * Remove subword_features from CharEmbed models Remove `subword_features` argument from `spacy.HashCharEmbedCNN.v1` and `spacy.HashCharEmbedBiLSTM.v1` since in these cases `subword_features` is always `False`. * Rename setting in morphologizer example Use `with_pos_tags` instead of `without_pos_tags`. * Fix kwargs for spacy.HashCharEmbedBiLSTM.v1 * Remove defaults for spacy.HashCharEmbedBiLSTM.v1 Remove default `nM/nC` for `spacy.HashCharEmbedBiLSTM.v1`. * Set random seed for textcat overfitting test	2020-04-02 14:46:32 +02:00
Sofie Van Landeghem	ab59f3124e	fix NEL overfitting test for GPU (#5236 )	2020-04-02 10:32:52 +02:00
Sofie Van Landeghem	311133e579	Train textcat with config (#5143 ) * bring back default build_text_classifier method * remove _set_dims_ hack in favor of proper dim inference * add tok2vec initialize to unit test * small fixes * add unit test for various textcat config settings * logistic output layer does not have nO * fix window_size setting * proper fix * fix W initialization * Update textcat training example * Use ml_datasets * Convert training data to `Example` format * Use `n_texts` to set proportionate dev size * fix _init renaming on latest thinc * avoid setting a non-existing dim * update to thinc==8.0.0a2 * add BOW and CNN defaults for easy testing * various experiments with train_textcat script, fix softmax activation in textcat bow * allow textcat train script to work on other datasets as well * have dataset as a parameter * train textcat from config, with example config * add config for training textcat * formatting * fix exclusive_classes * fixing BOW for GPU * bump thinc to 8.0.0a3 (not published yet so CI will fail) * add in link_vectors_to_models which got deleted Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2020-03-29 19:40:36 +02:00
Sofie Van Landeghem	9b412516e7	Fixing pickling of the parser (#5218 ) * fix __reduce__ for pickling parser * setting the move object as 'state' during pickling * unskip test_issue4725 - works again	2020-03-27 19:35:26 +01:00
Ines Montani	46568f40a7	Merge branch 'master' into tmp/sync	2020-03-26 13:38:14 +01:00
Ines Montani	828acffc12	Tidy up and auto-format	2020-03-25 12:28:12 +01:00
Sofie Van Landeghem	5847be6022	Tok2Vec: extract-embed-encode (#5102 ) * avoid changing original config * fix elif structure, batch with just int crashes otherwise * tok2vec example with doc2feats, encode and embed architectures * further clean up MultiHashEmbed * further generalize Tok2Vec to work with extract-embed-encode parts * avoid initializing the charembed layer with Docs (for now ?) * small fixes for bilstm config (still does not run) * rename to core layer * move new configs * walk model to set nI instead of using core ref * fix senter overfitting test to be more similar to the training data (avoid flakey behaviour)	2020-03-08 13:23:18 +01:00
adrianeboyd	c95ce96c44	Update sentence recognizer (#5109 ) * Update sentence recognizer * rename `sentrec` to `senter` * use `spacy.HashEmbedCNN.v1` by default * update to follow `Tagger` modifications * remove component methods that can be inherited from `Tagger` * add simple initialization and overfitting pipeline tests * Update serialization test for senter	2020-03-06 14:45:02 +01:00
Sofie Van Landeghem	6ac9fc0619	Unit test for NEL functionality (#5114 ) * empty begin_training for sentencizer * overfitting unit test for entity linker * fixed NEL IO by storing the entity_vector_length in the cfg	2020-03-06 14:42:23 +01:00
Ines Montani	b0cfab317f	Merge branch 'develop' into refactor/simplify-warnings	2020-03-04 16:38:55 +01:00
Sofie Van Landeghem	c6b12ab02a	Bugfix/get doc (#5049 ) * new (broken) unit test * fixing get_doc method	2020-03-02 11:49:28 +01:00
Ines Montani	648f61d077	Tidy up compiler flags and imports (#5071 )	2020-03-02 11:48:10 +01:00
Ines Montani	37691e6d5d	Simplify warnings	2020-02-28 12:20:23 +01:00
Ines Montani	5da3ad682a	Tidy up and auto-format	2020-02-28 11:57:41 +01:00
Sofie Van Landeghem	06f0a8daa0	Default settings to configurations (#4995 ) * fix grad_clip naming * cleaning up pretrained_vectors out of cfg * further refactoring Model init's * move Model building out of pipes * further refactor to require a model config when creating a pipe * small fixes * making cfg in nn_parser more consistent * fixing nr_class for parser * fixing nn_parser's nO * fix printing of loss * architectures in own file per type, consistent naming * convenience methods default_tagger_config and default_tok2vec_config * let create_pipe access default config if available for that component * default_parser_config * move defaults to separate folder * allow reading nlp from package or dir with argument 'name' * architecture spacy.VocabVectors.v1 to read static vectors from file * cleanup * default configs for nel, textcat, morphologizer, tensorizer * fix imports * fixing unit tests * fixes and clean up * fixing defaults, nO, fix unit tests * restore parser IO * fix IO * 'fix' serialization test * add .cfg to manifest fix example configs with additional arguments * replace Morpohologizer with Tagger * add IO bit when testing overfitting of tagger (currently failing) * fix IO - don't initialize when reading from disk * expand overfitting tests to also check IO goes OK * remove dropout from HashEmbed to fix Tagger performance * add defaults for sentrec * update thinc * always pass a Model instance to a Pipe * fix piped_added statement * remove obsolete W029 * remove obsolete errors * restore byte checking tests (work again) * clean up test * further test cleanup * convert from config to Model in create_pipe * bring back error when component is not initialized * cleanup * remove calls for nlp2.begin_training * use thinc.api in imports * allow setting charembed's nM and nC * fix for hardcoded nM/nC + unit test * formatting fixes * trigger build	2020-02-27 18:42:27 +01:00
Ines Montani	e3f40a6a0f	Tidy up and auto-format	2020-02-18 15:38:18 +01:00
Ines Montani	de11ea753a	Merge branch 'master' into develop	2020-02-18 14:47:23 +01:00
Kabir Khan	f6ed07b85c	Use nlp.pipe in EntityRuler for phrase patterns in add_patterns (#4931 ) * Fix ent_ids and labels properties when id attribute used in patterns * use set for labels * sort end_ids for comparison in entity_ruler tests * fixing entity_ruler ent_ids test * add to set * Run make_doc optimistically if using phrase matcher patterns. * remove unused coveragerc I was testing with * format * Refactor EntityRuler.add_patterns to use nlp.pipe for phrase patterns. Improves speed substantially. * Removing old add_patterns function * Fixing spacing * Make sure token_patterns loaded as well, before generator was being emptied in from_disk	2020-02-16 18:17:47 +01:00
Sofie Van Landeghem	72c964bcf4	define pretrained_dims which is used by build_text_classifier (#5004 )	2020-02-16 17:21:17 +01:00
Sofie Van Landeghem	cabd60fa1e	Small fixes to as_example (#4957 ) * label in span not writable anymore * Revert "label in span not writable anymore" This reverts commit `ab442338c8`. * fixing yield - remove redundant list	2020-02-03 13:02:12 +01:00
Sofie Van Landeghem	569cc98982	Update spaCy for thinc 8.0.0 (#4920 ) * Add load_from_config function * Add train_from_config script * Merge configs and expose via spacy.config * Fix script * Suggest create_evaluation_callback * Hard-code for NER * Fix errors * Register command * Add TODO * Update train-from-config todos * Fix imports * Allow delayed setting of parser model nr_class * Get train-from-config working * Tidy up and fix scores and printing * Hide traceback if cancelled * Fix weighted score formatting * Fix score formatting * Make output_path optional * Add Tok2Vec component * Tidy up and add tok2vec_tensors * Add option to copy docs in nlp.update * Copy docs in nlp.update * Adjust nlp.update() for set_annotations * Don't shuffle pipes in nlp.update, decruft * Support set_annotations arg in component update * Support set_annotations in parser update * Add get_gradients method * Add get_gradients to parser * Update errors.py * Fix problems caused by merge * Add _link_components method in nlp * Add concept of 'listeners' and ControlledModel * Support optional attributes arg in ControlledModel * Try having tok2vec component in pipeline * Fix tok2vec component * Fix config * Fix tok2vec * Update for Example * Update for Example * Update config * Add eg2doc util * Update and add schemas/types * Update schemas * Fix nlp.update * Fix tagger * Remove hacks from train-from-config * Remove hard-coded config str * Calculate loss in tok2vec component * Tidy up and use function signatures instead of models * Support union types for registry models * Minor cleaning in Language.update * Make ControlledModel specifically Tok2VecListener * Fix train_from_config * Fix tok2vec * Tidy up * Add function for bilstm tok2vec * Fix type * Fix syntax * Fix pytorch optimizer * Add example configs * Update for thinc describe changes * Update for Thinc changes * Update for dropout/sgd changes * Update for dropout/sgd changes * Unhack gradient update * Work on refactoring _ml * Remove _ml.py module * WIP upgrade cli scripts for thinc * Move some _ml stuff to util * Import link_vectors from util * Update train_from_config * Import from util * Import from util * Temporarily add ml.component_models module * Move ml methods * Move typedefs * Update load vectors * Update gitignore * Move imports * Add PrecomputableAffine * Fix imports * Fix imports * Fix imports * Fix missing imports * Update CLI scripts * Update spacy.language * Add stubs for building the models * Update model definition * Update create_default_optimizer * Fix import * Fix comment * Update imports in tests * Update imports in spacy.cli * Fix import * fix obsolete thinc imports * update srsly pin * from thinc to ml_datasets for example data such as imdb * update ml_datasets pin * using STATE.vectors * small fix * fix Sentencizer.pipe * black formatting * rename Affine to Linear as in thinc * set validate explicitely to True * rename with_square_sequences to with_list2padded * rename with_flatten to with_list2array * chaining layernorm * small fixes * revert Optimizer import * build_nel_encoder with new thinc style * fixes using model's get and set methods * Tok2Vec in component models, various fixes * fix up legacy tok2vec code * add model initialize calls * add in build_tagger_model * small fixes * setting model dims * fixes for ParserModel * various small fixes * initialize thinc Models * fixes * consistent naming of window_size * fixes, removing set_dropout * work around Iterable issue * remove legacy tok2vec * util fix * fix forward function of tok2vec listener * more fixes * trying to fix PrecomputableAffine (not succesful yet) * alloc instead of allocate * add morphologizer * rename residual * rename fixes * Fix predict function * Update parser and parser model * fixing few more tests * Fix precomputable affine * Update component model * Update parser model * Move backprop padding to own function, for test * Update test * Fix p. affine * Update NEL * build_bow_text_classifier and extract_ngrams * Fix parser init * Fix test add label * add build_simple_cnn_text_classifier * Fix parser init * Set gpu off by default in example * Fix tok2vec listener * Fix parser model * Small fixes * small fix for PyTorchLSTM parameters * revert my_compounding hack (iterable fixed now) * fix biLSTM * Fix uniqued * PyTorchRNNWrapper fix * small fixes * use helper function to calculate cosine loss * small fixes for build_simple_cnn_text_classifier * putting dropout default at 0.0 to ensure the layer gets built * using thinc util's set_dropout_rate * moving layer normalization inside of maxout definition to optimize dropout * temp debugging in NEL * fixed NEL model by using init defaults ! * fixing after set_dropout_rate refactor * proper fix * fix test_update_doc after refactoring optimizers in thinc * Add CharacterEmbed layer * Construct tagger Model * Add missing import * Remove unused stuff * Work on textcat * fix test (again :)) after optimizer refactor * fixes to allow reading Tagger from_disk without overwriting dimensions * don't build the tok2vec prematuraly * fix CharachterEmbed init * CharacterEmbed fixes * Fix CharacterEmbed architecture * fix imports * renames from latest thinc update * one more rename * add initialize calls where appropriate * fix parser initialization * Update Thinc version * Fix errors, auto-format and tidy up imports * Fix validation * fix if bias is cupy array * revert for now * ensure it's a numpy array before running bp in ParserStepModel * no reason to call require_gpu twice * use CupyOps.to_numpy instead of cupy directly * fix initialize of ParserModel * remove unnecessary import * fixes for CosineDistance * fix device renaming * use refactored loss functions (Thinc PR 251) * overfitting test for tagger * experimental settings for the tagger: avoid zero-init and subword normalization * clean up tagger overfitting test * use previous default value for nP * remove toy config * bringing layernorm back (had a bug - fixed in thinc) * revert setting nP explicitly * remove setting default in constructor * restore values as they used to be * add overfitting test for NER * add overfitting test for dep parser * add overfitting test for textcat * fixing init for linear (previously affine) * larger eps window for textcat * ensure doc is not None * Require newer thinc * Make float check vaguer * Slop the textcat overfit test more * Fix textcat test * Fix exclusive classes for textcat * fix after renaming of alloc methods * fixing renames and mandatory arguments (staticvectors WIP) * upgrade to thinc==8.0.0.dev3 * refer to vocab.vectors directly instead of its name * rename alpha to learn_rate * adding hashembed and staticvectors dropout * upgrade to thinc 8.0.0.dev4 * add name back to avoid warning W020 * thinc dev4 * update srsly * using thinc 8.0.0a0 ! Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com> Co-authored-by: Ines Montani <ines@ines.io>	2020-01-29 17:06:46 +01:00
adrianeboyd	a938566b62	Fix Sentencizer.pipe() for empty doc (#4940 )	2020-01-28 11:36:49 +01:00
adrianeboyd	199d89943e	Add as_example to Sentencizer pipe() (#4933 )	2020-01-22 15:40:31 +01:00
Kabir Khan	b9afcd56e3	Fix ent_ids and labels properties when id attribute used in patterns (#4900 ) * Fix ent_ids and labels properties when id attribute used in patterns * use set for labels * sort end_ids for comparison in entity_ruler tests * fixing entity_ruler ent_ids test * add to set	2020-01-16 02:01:31 +01:00
Sofie Van Landeghem	7b96a5e10f	Reduce mem usage in training Entity Linker (#4811 ) * move nlp processing for el pipe to batch training instead of preprocessing * adding dev eval back in, and limit in articles instead of entities * use pipe whenever possible * few more small doc changes * access dev data through generator * tqdm description * small fixes * update documentation	2020-01-06 14:59:50 +01:00
Ines Montani	a892821c51	More formatting changes	2019-12-25 17:59:52 +01:00
Ines Montani	db55577c45	Drop Python 2.7 and 3.5 (#4828 ) * Remove unicode declarations * Remove Python 3.5 and 2.7 from CI * Don't require pathlib * Replace compat helpers * Remove OrderedDict * Use f-strings * Set Cython compiler language level * Fix typo * Re-add OrderedDict for Table * Update setup.cfg * Revert CONTRIBUTING.md * Revert lookups.md * Revert top-level.md * Small adjustments and docs [ci skip]	2019-12-22 01:53:56 +01:00
Ines Montani	947dba7141	Merge branch 'master' into develop	2019-12-21 19:04:43 +01:00
Ines Montani	cb4145adc7	Tidy up and auto-format	2019-12-21 19:04:17 +01:00
Ines Montani	158b98a3ef	Merge branch 'master' into develop	2019-12-21 18:55:03 +01:00
Sofie Van Landeghem	557dcf5659	NEL requires sentences to be set (#4801 )	2019-12-13 15:55:18 +01:00

1 2 3 4 5 ...

287 Commits