spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-07-14 02:02:20 +03:00

Author	SHA1	Message	Date
Ines Montani	25d6ed3fb8	Merge pull request #5489 from explosion/feature/connected-components	2020-05-22 17:40:11 +02:00
Ines Montani	841c05b47b	Merge pull request #5490 from explosion/fix/remove-jsonschema	2020-05-22 17:39:54 +02:00
Ines Montani	569a65b60e	Auto-format	2020-05-22 16:55:42 +02:00
Ines Montani	d844528c5f	Add test for is_compatible_model	2020-05-22 16:55:15 +02:00
Ines Montani	12b7be1d98	Remove jsonschema from dependencies	2020-05-22 16:49:26 +02:00
Matthew Honnibal	f7f6df7275	Move to spacy.analysis	2020-05-22 16:43:18 +02:00
Matthew Honnibal	78d79d94ce	Guess set_annotations=True in nlp.update During `nlp.update`, components can be passed a boolean set_annotations to indicate whether they should assign annotations to the `Doc`. This needs to be called if downstream components expect to use the annotations during training, e.g. if we wanted to use tagger features in the parser. Components can specify their assignments and requirements, so we can figure out which components have these inter-dependencies. After figuring this out, we can guess whether to pass set_annotations=True. We could also call set_annotations=True always, or even just have this as the only behaviour. The downside of this is that it would require the `Doc` objects to be created afresh to avoid problematic modifications. One approach would be to make a fresh copy of the `Doc` objects within `nlp.update()`, so that we can write to the objects without any problems. If we do that, we can drop this logic and also drop the `set_annotations` mechanism. I would be fine with that approach, although it runs the risk of introducing some performance overhead, and we'll have to take care to copy all extension attributes etc.	2020-05-22 15:55:45 +02:00
Adriane Boyd	4b229bfc22	Improve handling of NER in CoNLL-U MISC	2020-05-20 18:48:51 +02:00
Sofie Van Landeghem	7f5715a081	Various fixes to NEL functionality, Example class etc (#5460 ) * setting KB in the EL constructor, similar to how the model is passed on * removing wikipedia example files - moved to projects * throw an error when nlp.update is called with 2 positional arguments * rewriting the config logic in create pipe to accomodate for other objects (e.g. KB) in the config * update config files with new parameters * avoid training pipeline components that don't have a model (like sentencizer) * various small fixes + UX improvements * small fixes * set thinc to 8.0.0a9 everywhere * remove outdated comment	2020-05-20 11:41:12 +02:00
Sofie Van Landeghem	f00de445dd	default models defined in component decorator (#5452 ) * move defaults to pipeline and use in component decorator * black formatting * relative import	2020-05-19 16:20:03 +02:00
Sofie Van Landeghem	0d94737857	Feature toggle_pipes (#5378 ) * make disable_pipes deprecated in favour of the new toggle_pipes * rewrite disable_pipes statements * update documentation * remove bin/wiki_entity_linking folder * one more fix * remove deprecated link to documentation * few more doc fixes * add note about name change to the docs * restore original disable_pipes * small fixes * fix typo * fix error number to W096 * rename to select_pipes * also make changes to the documentation Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-05-18 22:27:10 +02:00
Matthew Honnibal	333b1a308b	Adapt parser and NER for transformers (#5449 ) * Draft layer for BILUO actions * Fixes to biluo layer * WIP on BILUO layer * Add tests for BILUO layer * Format * Fix transitions * Update test * Link in the simple_ner * Update BILUO tagger * Update __init__ * Import simple_ner * Update test * Import * Add files * Add config * Fix label passing for BILUO and tagger * Fix label handling for simple_ner component * Update simple NER test * Update config * Hack train script * Update BILUO layer * Fix SimpleNER component * Update train_from_config * Add biluo_to_iob helper * Add IOB layer * Add IOBTagger model * Update biluo layer * Update SimpleNER tagger * Update BILUO * Read random seed in train-from-config * Update use of normal_init * Fix normalization of gradient in SimpleNER * Update IOBTagger * Remove print * Tweak masking in BILUO * Add dropout in SimpleNER * Update thinc * Tidy up simple_ner * Fix biluo model * Unhack train-from-config * Update setup.cfg and requirements * Add tb_framework.py for parser model * Try to avoid memory leak in BILUO * Move ParserModel into spacy.ml, avoid need for subclass. * Use updated parser model * Remove incorrect call to model.initializre in PrecomputableAffine * Update parser model * Avoid divide by zero in tagger * Add extra dropout layer in tagger * Refine minibatch_by_words function to avoid oom * Fix parser model after refactor * Try to avoid div-by-zero in SimpleNER * Fix infinite loop in minibatch_by_words * Use SequenceCategoricalCrossentropy in Tagger * Fix parser model when hidden layer * Remove extra dropout from tagger * Add extra nan check in tagger * Fix thinc version * Update tests and imports * Fix test * Update test * Update tests * Fix tests * Fix test Co-authored-by: Ines Montani <ines@ines.io>	2020-05-18 22:23:33 +02:00
svlandeg	047f3d7d94	remove ops argument for Adam	2020-05-15 13:25:00 +02:00
Sofie Van Landeghem	b2e93be867	Optimizer defaults (#5244 ) * set optimizer defaults to mimic thinc 7 + bump to dev6 * larger error range for senter overfitting test	2020-04-03 13:02:46 +02:00
adrianeboyd	b71a11ff6d	Update morphologizer (#5108 ) * Add pos and morph scoring to Scorer Add pos, morph, and morph_per_type to `Scorer`. Report pos and morph accuracy in `spacy evaluate`. * Update morphologizer for v3 * switch to tagger-based morphologizer * use `spacy.HashCharEmbedCNN` for morphologizer defaults * add `Doc.is_morphed` flag * Add morphologizer to train CLI * Add basic morphologizer pipeline tests * Add simple morphologizer training example * Remove subword_features from CharEmbed models Remove `subword_features` argument from `spacy.HashCharEmbedCNN.v1` and `spacy.HashCharEmbedBiLSTM.v1` since in these cases `subword_features` is always `False`. * Rename setting in morphologizer example Use `with_pos_tags` instead of `without_pos_tags`. * Fix kwargs for spacy.HashCharEmbedBiLSTM.v1 * Remove defaults for spacy.HashCharEmbedBiLSTM.v1 Remove default `nM/nC` for `spacy.HashCharEmbedBiLSTM.v1`. * Set random seed for textcat overfitting test	2020-04-02 14:46:32 +02:00
Sofie Van Landeghem	311133e579	Train textcat with config (#5143 ) * bring back default build_text_classifier method * remove _set_dims_ hack in favor of proper dim inference * add tok2vec initialize to unit test * small fixes * add unit test for various textcat config settings * logistic output layer does not have nO * fix window_size setting * proper fix * fix W initialization * Update textcat training example * Use ml_datasets * Convert training data to `Example` format * Use `n_texts` to set proportionate dev size * fix _init renaming on latest thinc * avoid setting a non-existing dim * update to thinc==8.0.0a2 * add BOW and CNN defaults for easy testing * various experiments with train_textcat script, fix softmax activation in textcat bow * allow textcat train script to work on other datasets as well * have dataset as a parameter * train textcat from config, with example config * add config for training textcat * formatting * fix exclusive_classes * fixing BOW for GPU * bump thinc to 8.0.0a3 (not published yet so CI will fail) * add in link_vectors_to_models which got deleted Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2020-03-29 19:40:36 +02:00
adrianeboyd	ce0e538068	Check whether doc is instantiated in Example.get_gold_parses() (#5167 ) * Check whether doc is instantiated When creating docs to pair with gold parses, modify test to check whether a doc is unset rather than whether it contains tokens. * Restore test of evaluate on an empty doc * Set a minimal gold.orig for the scorer Without a minimal gold.orig the scorer can't evaluate empty docs. This is the v3 equivalent of #4925.	2020-03-29 13:57:00 +02:00
Sofie Van Landeghem	d6d95674c1	bugfix in span similarity (#5155 ) * bugfix in span similarity * also rewrite doc.pyx for clarity * formatting	2020-03-29 13:56:07 +02:00
Sofie Van Landeghem	9b412516e7	Fixing pickling of the parser (#5218 ) * fix __reduce__ for pickling parser * setting the move object as 'state' during pickling * unskip test_issue4725 - works again	2020-03-27 19:35:26 +01:00
Ines Montani	92b9b631ef	xfail -> skip	2020-03-27 10:51:32 +01:00
Ines Montani	ee4bb0e3b6	Fix import	2020-03-26 21:44:18 +01:00
Ines Montani	4fe2299586	xfail hanging test	2020-03-26 20:58:13 +01:00
Ines Montani	f12a46472c	Remove unicode declarations	2020-03-26 15:18:32 +01:00
Ines Montani	46568f40a7	Merge branch 'master' into tmp/sync	2020-03-26 13:38:14 +01:00
Ines Montani	828acffc12	Tidy up and auto-format	2020-03-25 12:28:12 +01:00
adrianeboyd	86c43e55fa	Improve Lithuanian tokenization (#5205 ) * Improve Lithuanian tokenization Modify Lithuanian tokenization to improve performance for UD_Lithuanian-ALKSNIS. * Update Lithuanian tokenizer tests	2020-03-25 11:28:12 +01:00
Adriane Boyd	09d442f5ad	Merge remote-tracking branch 'upstream/master' into feature/ud-tokenization-da	2020-03-25 09:41:52 +01:00
Adriane Boyd	cba2d1d972	Disable failing abbreviation test UD_Danish-DDT has (as far as I can tell) hallucinated periods after abbreviations, so the changes are an artifact of the corpus and not due to anything meaningful about Danish tokenization.	2020-03-25 09:39:26 +01:00
svlandeg	59000ee21d	fix serialization of empty doc + unit test	2020-03-13 16:07:56 +01:00
Adriane Boyd	423849f94a	Fix sents comparison in test util Due to changes to `Span` (#5005), spans from different documents are now never equal. Check `Token.is_sent_start` values instead.	2020-03-13 09:25:23 +01:00
Sofie Van Landeghem	5847be6022	Tok2Vec: extract-embed-encode (#5102 ) * avoid changing original config * fix elif structure, batch with just int crashes otherwise * tok2vec example with doc2feats, encode and embed architectures * further clean up MultiHashEmbed * further generalize Tok2Vec to work with extract-embed-encode parts * avoid initializing the charembed layer with Docs (for now ?) * small fixes for bilstm config (still does not run) * rename to core layer * move new configs * walk model to set nI instead of using core ref * fix senter overfitting test to be more similar to the training data (avoid flakey behaviour)	2020-03-08 13:23:18 +01:00
Sofie Van Landeghem	1a2b8fc264	set vector of merged entity (#5085 ) * merge_entities sets the vector in the vocab for the merged token * add unit test * import unicode_literals * move code to _merge function * only set vector if vocab has non-zero vectors	2020-03-06 14:45:28 +01:00
adrianeboyd	c95ce96c44	Update sentence recognizer (#5109 ) * Update sentence recognizer * rename `sentrec` to `senter` * use `spacy.HashEmbedCNN.v1` by default * update to follow `Tagger` modifications * remove component methods that can be inherited from `Tagger` * add simple initialization and overfitting pipeline tests * Update serialization test for senter	2020-03-06 14:45:02 +01:00
Sofie Van Landeghem	6ac9fc0619	Unit test for NEL functionality (#5114 ) * empty begin_training for sentencizer * overfitting unit test for entity linker * fixed NEL IO by storing the entity_vector_length in the cfg	2020-03-06 14:42:23 +01:00
Muhammad Irfan	03376c9d9b	Basque language added and tested.	2020-03-04 11:58:56 +05:00
adrianeboyd	9be90dbca3	Improve token head verification (#5079 ) * Improve token head verification Improve the verification for valid token heads when heads are set: * in `Token.head`: heads come from the same document * in `Doc.from_array()`: head indices are within the bounds of the document * Improve error message	2020-03-03 21:44:51 +01:00
Sofie Van Landeghem	d307e9ca58	take care of global vectors in multiprocessing (#5081 ) * restore load_nlp.VECTORS in the child process * add unit test * fix test * remove unnecessary import * add utf8 encoding * import unicode_literals	2020-03-03 13:58:22 +01:00
adrianeboyd	697bec764d	Normalize IS_SENT_START to SENT_START for Matcher (#5080 )	2020-03-03 12:22:39 +01:00
adrianeboyd	2281c4708c	Restore empty tokenizer properties (#5026 ) * Restore empty tokenizer properties * Check for types in tokenizer.from_bytes() * Add test for setting empty tokenizer rules	2020-03-02 11:55:02 +01:00
Sofie Van Landeghem	c6b12ab02a	Bugfix/get doc (#5049 ) * new (broken) unit test * fixing get_doc method	2020-03-02 11:49:28 +01:00
Ines Montani	37691e6d5d	Simplify warnings	2020-02-28 12:20:23 +01:00
Ines Montani	5da3ad682a	Tidy up and auto-format	2020-02-28 11:57:41 +01:00
Sofie Van Landeghem	06f0a8daa0	Default settings to configurations (#4995 ) * fix grad_clip naming * cleaning up pretrained_vectors out of cfg * further refactoring Model init's * move Model building out of pipes * further refactor to require a model config when creating a pipe * small fixes * making cfg in nn_parser more consistent * fixing nr_class for parser * fixing nn_parser's nO * fix printing of loss * architectures in own file per type, consistent naming * convenience methods default_tagger_config and default_tok2vec_config * let create_pipe access default config if available for that component * default_parser_config * move defaults to separate folder * allow reading nlp from package or dir with argument 'name' * architecture spacy.VocabVectors.v1 to read static vectors from file * cleanup * default configs for nel, textcat, morphologizer, tensorizer * fix imports * fixing unit tests * fixes and clean up * fixing defaults, nO, fix unit tests * restore parser IO * fix IO * 'fix' serialization test * add .cfg to manifest fix example configs with additional arguments * replace Morpohologizer with Tagger * add IO bit when testing overfitting of tagger (currently failing) * fix IO - don't initialize when reading from disk * expand overfitting tests to also check IO goes OK * remove dropout from HashEmbed to fix Tagger performance * add defaults for sentrec * update thinc * always pass a Model instance to a Pipe * fix piped_added statement * remove obsolete W029 * remove obsolete errors * restore byte checking tests (work again) * clean up test * further test cleanup * convert from config to Model in create_pipe * bring back error when component is not initialized * cleanup * remove calls for nlp2.begin_training * use thinc.api in imports * allow setting charembed's nM and nC * fix for hardcoded nM/nC + unit test * formatting fixes * trigger build	2020-02-27 18:42:27 +01:00
Ines Montani	c1a5ece65f	Tidy up setup and update requirements tests	2020-02-25 15:46:39 +01:00
Ines Montani	5d21d3e8b9	Merge branch 'develop' into pr/5008	2020-02-25 15:24:47 +01:00
Ines Montani	4440a072d2	Merge pull request #5006 from svlandeg/bugfix/multiproc-underscore load Underscore state when multiprocessing	2020-02-25 14:46:02 +01:00
svlandeg	d821c95eb0	debugging prints	2020-02-23 17:38:33 +01:00
svlandeg	58568bd0cd	fix	2020-02-23 16:45:37 +01:00
svlandeg	0f55e51704	assert we found the root_dir	2020-02-23 16:33:58 +01:00
svlandeg	783da088ea	avoid try except	2020-02-23 16:21:21 +01:00

1 2 3 4 5 ...

1580 Commits