spaCy

mirror of https://github.com/explosion/spaCy.git synced 2024-11-15 06:09:01 +03:00

Author	SHA1	Message	Date
Adriane Boyd	5eeb25f043	Tidy up code	2021-06-28 12:08:15 +02:00
Adriane Boyd	4b0ed73ed4	Update flake8 version in reqs and CI * Update some unneeded forward refs related to flake8 checks	2021-06-28 11:29:36 +02:00
Matthew Honnibal	f9946154d9	Add SpanCategorizer component (#6747 ) * Draft spancat model * Add spancat model * Add test for extract_spans * Add extract_spans layer * Upd extract_spans * Add spancat model * Add test for spancat model * Upd spancat model * Update spancat component * Upd spancat * Update spancat model * Add quick spancat test * Import SpanCategorizer * Fix SpanCategorizer component * Import SpanGroup * Fix span extraction * Fix import * Fix import * Upd model * Update spancat models * Add scoring, update defaults * Update and add docs * Fix type * Update spacy/ml/extract_spans.py * Auto-format and fix import * Fix comment * Fix type * Fix type * Update website/docs/api/spancategorizer.md * Fix comment Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Better defense Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Fix labels list Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update spacy/ml/extract_spans.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update spacy/pipeline/spancat.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Set annotations during update * Set annotations in spancat * fix imports in test * Update spacy/pipeline/spancat.py * replace MaxoutLogistic with LinearLogistic * fix config * various small fixes * remove set_annotations parameter in update * use our beloved tupley format with recent support for doc.spans * bugfix to allow renaming the default span_key (scores weren't showing up) * use different key in docs example * change defaults to better-working parameters from project (WIP) * register spacy.extract_spans.v1 for legacy purposes * Upd dev version so can build wheel * layers instead of architectures for smaller building blocks * Update website/docs/api/spancategorizer.md Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update website/docs/api/spancategorizer.md Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Include additional scores from overrides in combined score weights * Parameterize spans key in scoring Parameterize the `SpanCategorizer` `spans_key` for scoring purposes so that it's possible to evaluate multiple `spancat` components in the same pipeline. * Use the (intentionally very short) default spans key `sc` in the `SpanCategorizer` * Adjust the default score weights to include the default key * Adjust the scorer to use `spans_{spans_key}` as the prefix for the returned score * Revert addition of `attr_name` argument to `score_spans` and adjust the key in the `getter` instead. Note that for `spancat` components with a custom `span_key`, the score weights currently need to be modified manually in `[training.score_weights]` for them to be available during training. To suppress the default score weights `spans_sc_p/r/f` during training, set them to `null` in `[training.score_weights]`. * Update website/docs/api/scorer.md * Fix scorer for spans key containing underscore * Increment version * Add Spans to Evaluate CLI (#8439) * Add Spans to Evaluate CLI * Change to spans_key * Add spans per_type output Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Fix spancat GPU issues (#8455) * Fix GPU issues * Require thinc >=8.0.6 * Switch to glorot_uniform_init * Fix and test ngram suggester * Include final ngram in doc for all sizes * Fix ngrams for docs of the same length as ngram size * Handle batches of docs that result in no ngrams * Add tests Co-authored-by: Ines Montani <ines@ines.io> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Nirant <NirantK@users.noreply.github.com>	2021-06-24 12:35:27 +02:00
Sofie Van Landeghem	e796aab4b3	Resizable textcat (#7862 ) * implement textcat resizing for TextCatCNN * resizing textcat in-place * simplify code * ensure predictions for old textcat labels remain the same after resizing (WIP) * fix for softmax * store softmax as attr * fix ensemble weight copy and cleanup * restructure slightly * adjust documentation, update tests and quickstart templates to use latest versions * extend unit test slightly * revert unnecessary edits * fix typo * ensemble architecture won't be resizable for now * use resizable layer (WIP) * revert using resizable layer * resizable container while avoid shape inference trouble * cleanup * ensure model continues training after resizing * use fill_b parameter * use fill_defaults * resize_layer callback * format * bump thinc to 8.0.4 * bump spacy-legacy to 3.0.6	2021-06-16 11:45:00 +02:00
Sofie Van Landeghem	8729307e67	register extract_ngrams layer (#8358 ) * register extract_ngrams layer * fix import * bump spacy-legacy to 3.0.6 * revert bump (wrong PR)	2021-06-14 10:30:30 +02:00
Vito De Tullio	3672464e25	applying suggestion to avoid mypy errors (#8265 ) * applying suggestion to avoid mypy errors * sign contributor agreement	2021-06-02 19:25:30 +10:00
Sofie Van Landeghem	e9037d8fc0	make EntityLinker robust for nO=None (#7930 )	2021-05-06 18:14:47 +10:00
Adriane Boyd	36ecba224e	Set up GPU CI testing (#7293 ) * Set up CI for tests with GPU agent * Update tests for enabled GPU * Fix steps filename * Add parallel build jobs as a setting * Fix test requirements * Fix install test requirements condition * Fix pipeline models test * Reset current ops in prefer/require testing * Fix more tests * Remove separate test_models test * Fix regression 5551 * fix StaticVectors for GPU use * fix vocab tests * Fix regression test 5082 * Move azure steps to .github and reenable default pool jobs * Consolidate/rename azure steps Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>	2021-04-22 14:58:29 +02:00
Adriane Boyd	d2bdaa7823	Replace negative rows with 0 in StaticVectors (#7674 ) * Replace negative rows with 0 in StaticVectors Replace negative row indices with 0-vectors in `StaticVectors`. * Increase versions related to StaticVectors * Increase versions of all architctures and layers related to `StaticVectors` * Improve efficiency of 0-vector operations Parallel `spacy-legacy` PR: https://github.com/explosion/spacy-legacy/pull/5 * Update config defaults to new versions * Update docs	2021-04-22 18:04:15 +10:00
Adriane Boyd	07b41c38ae	Register CharEmbed layer (#7805 )	2021-04-19 18:39:34 +10:00
Sofie Van Landeghem	cd70c3cb79	Fixing pretrain (#7342 ) * initialize NLP with train corpus * add more pretraining tests * more tests * function to fetch tok2vec layer for pretraining * clarify parameter name * test different objectives * formatting * fix check for static vectors when using vectors objective * clarify docs * logger statement * fix init_tok2vec and proc.initialize order * test training after pretraining * add init_config tests for pretraining * pop pretraining block to avoid config validation errors * custom errors	2021-03-09 14:01:13 +11:00
svlandeg	d900c55061	consistently use registry as callable	2021-03-02 17:56:28 +01:00
René Octavio Queiroz Dias	59271e887a	fix: TransformerListener with TextCatEnsemble (#6951 ) * bug: Regression test Issue #6946 * fix: Fix issue #6946 * chore: Remove regression test	2021-02-06 13:44:51 +01:00
Matthew Honnibal	ffc371350a	Avoid assuming encode.get_dim('nO') is set in tok2vec (#6800 )	2021-01-24 14:37:33 +11:00
Sofie Van Landeghem	c8761b0e6e	rewrite Maxout layer as separate layers to avoid shape inference trouble (#6760 )	2021-01-19 07:37:17 +08:00
Adriane Boyd	26c34ab8b0	Fix parser resizing for cupy (#6758 )	2021-01-18 20:43:15 +01:00
Matthew Honnibal	c2a18e4fa3	Update textcat ensemble model	2021-01-19 02:53:02 +11:00
Ines Montani	a203e3dbb8	Support spacy-legacy via the registry	2021-01-15 21:42:40 +11:00
Ines Montani	b0b743597c	Tidy up and auto-format	2021-01-15 11:57:36 +11:00
Sofie Van Landeghem	75d9019343	Fix types of Tok2Vec encoding architectures (#6442 ) * fix TorchBiLSTMEncoder documentation * ensure the types of the encoding Tok2vec layers are correct * update references from v1 to v2 for the new architectures	2021-01-07 16:39:27 +11:00
Sofie Van Landeghem	3983bc6b1e	Fix Transformer width in TextCatEnsemble (#6431 ) * add convenience method to determine tok2vec width in a model * fix transformer tok2vec dimensions in TextCatEnsemble architecture * init function should not be nested to avoid pickle issues	2021-01-06 12:44:04 +01:00
Ines Montani	991669c934	Tidy up and auto-format	2021-01-05 13:41:53 +11:00
Sofie Van Landeghem	282a3b49ea	Fix parser resizing when there is no upper layer (#6460 ) * allow resizing of the parser model even when upper=False * update from spacy.TransitionBasedParser.v1 to v2 * bugfix	2020-12-18 18:56:57 +08:00
Sofie Van Landeghem	cfc72c2995	Bugfix multi-label textcat reproducibility (#6481 ) * add test for multi-label textcat reproducibility * remove positive_label * fix lengths dtype * fix comments * remove comment that we should not have forgotten :-)	2020-12-09 06:29:15 +08:00
Sofie Van Landeghem	de108ed3e8	Add specific error when StaticVectors can't read the vectors data (#6450 )	2020-12-09 06:16:07 +08:00
Sofie Van Landeghem	f98a04434a	pretrain architectures (#6451 ) * define new architectures for the pretraining objective * add loss function as attr of the omdel * cleanup * cleanup * shorten name * fix typo * remove unused error	2020-12-08 14:41:03 +08:00
Sofie Van Landeghem	a0c899a0ff	Fix textcat + transformer architecture (#6371 ) * add pooling to textcat TransformerListener * maybe_get_dim in case it's null	2020-11-10 20:14:47 +08:00
Sofie Van Landeghem	75a202ce65	TextCat updates and fixes (#6263 ) * small fix in example imports * throw error when train_corpus or dev_corpus is not a string * small fix in custom logger example * limit macro_auc to labels with 2 annotations * fix typo * also create parents of output_dir if need be * update documentation of textcat scores * refactor TextCatEnsemble * fix tests for new AUC definition * bump to 3.0.0a42 * update docs * rename to spacy.TextCatEnsemble.v2 * spacy.TextCatEnsemble.v1 in legacy * cleanup * small fix * update to 3.0.0rc2 * fix import that got lost in merge * cursed IDE * fix two typos	2020-10-18 14:50:41 +02:00
Sofie Van Landeghem	f8a1c1afd6	avoid dropout at runtime (#6247 )	2020-10-13 14:39:59 +02:00
svlandeg	40276fd3be	update NEL docs after latest refactor	2020-10-12 11:41:27 +02:00
svlandeg	08cb085f6c	Merge remote-tracking branch 'upstream/develop' into fix/various	2020-10-09 17:01:27 +02:00
svlandeg	040c7c0541	fix get_dim calls in build_simple_cnn_text_classifier	2020-10-09 15:40:58 +02:00
svlandeg	853edace37	fix MultiHashEmbed example in documentation	2020-10-09 14:11:06 +02:00
Adriane Boyd	39aabf50ab	Also rename to include_static_vectors in CharEmbed	2020-10-09 11:54:48 +02:00
Matthew Honnibal	cfb9770a94	Fix empty input into StaticVectors layer (#6211 ) * Add test for empty doc(s) * Fix empty check in staticvectors * Remove xfail * Update spacy/ml/staticvectors.py	2020-10-06 14:15:41 +02:00
Ines Montani	1a554bdcb1	Update docs and docstring [ci skip]	2020-10-05 21:55:27 +02:00
Ines Montani	9614e53b02	Tidy up and auto-format	2020-10-05 21:55:18 +02:00
Matthew Honnibal	e50047f1c5	Check lengths match	2020-10-05 20:02:45 +02:00
Matthew Honnibal	cdd2b79b6d	Remove deprecated MultiHashEmbed	2020-10-05 19:58:18 +02:00
Matthew Honnibal	6dcc4a0ba6	Simplify MultiHashEmbed signature	2020-10-05 19:57:45 +02:00
Matthew Honnibal	eb9ba61517	Format	2020-10-05 15:29:49 +02:00
Matthew Honnibal	8ec79ad3fa	Allow configuration of MultiHashEmbed features Update arguments to MultiHashEmbed layer so that the attributes can be controlled. A kind of tricky scheme is used to allow optional specification of the rows. I think it's an okay balance between flexibility and convenience.	2020-10-05 15:22:00 +02:00
Ines Montani	bcd52e5486	Tidy up errors and warnings	2020-10-04 11:16:31 +02:00
Ines Montani	3bc3c05fcc	Tidy up and auto-format	2020-10-03 17:20:18 +02:00
svlandeg	02247cccaf	Merge remote-tracking branch 'upstream/develop' into feature/small-fixes	2020-10-02 20:48:11 +02:00
Matthew Honnibal	6965cdf16d	Fix comment	2020-10-02 17:26:21 +02:00
Ines Montani	af282ae732	Fix import	2020-10-02 01:12:34 +02:00
Ines Montani	e59ecb12c0	Auto-format	2020-10-02 01:12:30 +02:00
Matthew Honnibal	75a1569908	Merge	2020-10-01 23:07:53 +02:00
Matthew Honnibal	300e5a9928	Avoid relying on NORM in default v3 models (#6176 ) * Allow CharacterEmbed to specify feature * Default to LOWER in character embed * Update tok2vec * Use LOWER, not NORM	2020-10-01 23:05:55 +02:00
Matthew Honnibal	b854bca15c	Default to LOWER in character embed	2020-10-01 22:17:58 +02:00
Matthew Honnibal	684a77870b	Allow CharacterEmbed to specify feature	2020-10-01 22:17:26 +02:00
Sofie Van Landeghem	a22215f427	Add FeatureExtractor from Thinc (#6170 ) * move featureextractor from Thinc * Update website/docs/api/architectures.md Co-authored-by: Ines Montani <ines@ines.io> * Update website/docs/api/architectures.md Co-authored-by: Ines Montani <ines@ines.io> Co-authored-by: Ines Montani <ines@ines.io>	2020-10-01 16:22:48 +02:00
svlandeg	5121972930	add types of Tok2Vec embedding layers	2020-10-01 09:20:09 +02:00
svlandeg	5a9fdbc8ad	state_type as Literal	2020-09-23 17:32:14 +02:00
svlandeg	25b34bba94	throw custom error when state_type is invalid	2020-09-23 16:57:14 +02:00
svlandeg	dd2292793f	'parser' instead of 'deps' for state_type	2020-09-23 16:53:49 +02:00
svlandeg	6c85fab316	state_type and extra_state_tokens instead of nr_feature_tokens	2020-09-23 13:35:09 +02:00
Ines Montani	1114219ae3	Tidy up and auto-format	2020-09-21 10:59:07 +02:00
Adriane Boyd	f3db3f6fe0	Add vectors option to CharacterEmbed (#6069 ) * Add vectors option to CharacterEmbed * Update spacy/pipeline/morphologizer.pyx * Adjust default morphologizer config Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-09-16 17:45:04 +02:00
Ines Montani	1955aaaa20	Merge pull request #6045 from svlandeg/feature/more-layers-docs [ci skip]	2020-09-09 21:46:40 +02:00
Sofie Van Landeghem	cb66ea7400	Remove simple_ner code (#6041 ) * remove simple_ner code * remove unused _biluo and _iob files	2020-09-09 16:11:27 +02:00
svlandeg	39aa740777	Merge remote-tracking branch 'upstream/develop' into feature/more-layers-docs	2020-09-09 11:59:34 +02:00
Sofie Van Landeghem	60f22e1800	Pipe API (#6034 ) * ensure Language passes on valid examples for initialization * fix tagger model initialization * check for valid get_examples across components * assume labels were added before begin_training * fix senter initialization * fix morphologizer initialization * use methods to check arguments * test textcat init, requires thinc>=8.0.0a31 * fix tok2vec init * fix entity linker init * use islice * fix simple NER * cleanup debug model * fix assert statements * fix tests * throw error when adding a label if the output layer can't be resized anymore * fix test * add failing test for simple_ner * UX improvements * morphologizer UX * assume begin_training gets a representative set and processes the labels * remove assumptions for output of untrained NER model * restore test for original purpose	2020-09-08 22:44:25 +02:00
svlandeg	bd8f9b188b	small fixes	2020-09-08 17:24:36 +02:00
svlandeg	06ef66fd73	Merge remote-tracking branch 'upstream/develop' into feature/more-layers-docs	2020-09-08 10:28:42 +02:00
svlandeg	c32fcdf4c9	fix typo	2020-09-04 09:10:21 +02:00
Ines Montani	5afe6447cd	registry.assets -> registry.misc	2020-09-03 17:31:14 +02:00
Ines Montani	091a9b522a	Remove unused variable [ci skip]	2020-08-29 13:11:26 +02:00
Matthew Honnibal	160a855246	Format	2020-08-23 21:15:12 +02:00
Sofie Van Landeghem	358cbb21e3	Define candidate generator in EL config (#5876 ) * candidate generator as separate part of EL config * update comment * ent instead of str as input for candidate generation * Span instead of str: correct type indication * fix types * unit test to create new candidate generator * fix replace_pipe argument passing * move error message, general cleanup * add vocab back to KB constructor * provide KB as callable from Vocab arg * rename to kb_loader, fix KB serialization as part of the EL pipe * fix typo * reformatting * cleanup * fix comment * fix wrongly duplicated code from merge conflict * rename dump to to_disk * from_disk instead of load_bulk * update test after recent removal of set_morphology in tagger * remove old doc	2020-08-18 16:10:36 +02:00
Ines Montani	3a193eb8f1	Fix imports, types and default configs	2020-08-07 18:40:54 +02:00
Matthew Honnibal	b1d83fc13e	Fix imports	2020-08-07 16:55:54 +02:00
Matthew Honnibal	473504d837	Format	2020-08-07 16:49:00 +02:00
Matthew Honnibal	234c52a91e	Add tok2vec docstrings	2020-08-07 16:48:48 +02:00
Matthew Honnibal	547bc8a82b	Add docstring notes	2020-08-07 16:17:34 +02:00
Matthew Honnibal	da6e59519e	Add docstrings for simple_ner	2020-08-07 15:09:49 +02:00
Matthew Honnibal	7ef8a64df9	Add docstring for parser	2020-08-07 14:59:34 +02:00
Ines Montani	e68459296d	Tidy up and auto-format	2020-08-05 16:00:59 +02:00
Sofie Van Landeghem	82347110f5	Default empty KB in EL component (#5872 ) * EL field documentation * documentation consistent with docs * default empty KB, initialize vocab separately * formatting * add test for changing the default entity vector length * update comment	2020-08-04 14:34:09 +02:00
Ines Montani	e9e8fa2466	Update docs and types	2020-07-31 17:02:54 +02:00
Sofie Van Landeghem	ca491722ad	The Parser is now a Pipe (2) (#5844 ) * moving syntax folder to _parser_internals * moving nn_parser and transition_system * move nn_parser and transition_system out of internals folder * moving nn_parser code into transition_system file * rename transition_system to transition_parser * moving parser_model and _state to ml * move _state back to internals * The Parser now inherits from Pipe! * small code fixes * removing unnecessary imports * remove link_vectors_to_models * transition_system to internals folder * little bit more cleanup * newlines	2020-07-30 23:30:54 +02:00
Matthew Honnibal	142b58be92	Fix import	2020-07-29 14:45:09 +02:00
Matthew Honnibal	c99a653070	Adjust textcat model	2020-07-29 14:38:15 +02:00
Matthew Honnibal	9e1b11dd81	Update vectors in textcat	2020-07-29 14:35:36 +02:00
Matthew Honnibal	07b47eaac8	Update tok2vec layer	2020-07-29 14:01:13 +02:00
Matthew Honnibal	5ae8628571	Fix CharacterEmbed layer	2020-07-29 14:01:13 +02:00
Matthew Honnibal	00de30bcc2	Update CharacterEmbed function	2020-07-29 14:01:12 +02:00
Matthew Honnibal	c35d6282fc	Add previous HashEmbedCNN tok2vec to make transition easier	2020-07-29 14:01:12 +02:00
Matthew Honnibal	0c17ea4c85	Format	2020-07-29 14:00:13 +02:00
Matthew Honnibal	475d7c1c7c	Fix StaticVectors class	2020-07-29 14:00:11 +02:00
Matthew Honnibal	44d350dc94	Use spaCy's StaticVectors	2020-07-29 14:00:11 +02:00
Matthew Honnibal	099e9331c5	Fix tok2vec	2020-07-29 14:00:10 +02:00
Matthew Honnibal	fe0cdcd461	Fixes	2020-07-29 14:00:09 +02:00
Matthew Honnibal	123f8b832d	Refactor Tok2Vec model	2020-07-29 14:00:09 +02:00
Matthew Honnibal	c6b4f63c7c	Remove obsolete function	2020-07-29 14:00:09 +02:00
Matthew Honnibal	9cc7262224	Draft StaticVectors layer	2020-07-29 14:00:09 +02:00
Matthew Honnibal	cb9654e98c	WIP on new StaticVectors	2020-07-29 14:00:09 +02:00
Ines Montani	ed61fb10fc	Rename default textcat arch to TextCatEnsemble	2020-07-26 15:11:43 +02:00
Ines Montani	e92df281ce	Tidy up, autoformat, add types	2020-07-25 15:01:15 +02:00

1 2 3 4

199 Commits