spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-11-25 04:16:11 +03:00

Author	SHA1	Message	Date
Adriane Boyd	f3db3f6fe0	Add vectors option to CharacterEmbed (#6069 ) * Add vectors option to CharacterEmbed * Update spacy/pipeline/morphologizer.pyx * Adjust default morphologizer config Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-09-16 17:45:04 +02:00
Adriane Boyd	d722a439aa	Remove unneeded methods in senter and morphologizer (#6074 ) Now that the tagger doesn't manage the tag map, the child classes senter and morphologizer don't need to override the serialization methods.	2020-09-16 17:39:41 +02:00
Sofie Van Landeghem	3216a33149	positive_label config for textcat (#6062 ) * hook up positive_label in textcat * unit tests * documentation * formatting * tests * fix typo * move verify_config to after begin_training * revert accidential commit	2020-09-14 17:08:00 +02:00
Sofie Van Landeghem	744df9814a	define threshold for scoring textcat in TextCat config (#6055 ) * define threshold for scoring textcat in TextCat config * fix unit test and documentation	2020-09-13 14:15:52 +02:00
Sofie Van Landeghem	cb66ea7400	Remove simple_ner code (#6041 ) * remove simple_ner code * remove unused _biluo and _iob files	2020-09-09 16:11:27 +02:00
Sofie Van Landeghem	8e7557656f	Renaming gold & annotation_setter (#6042 ) * version bump to 3.0.0a16 * rename "gold" folder to "training" * rename 'annotation_setter' to 'set_extra_annotations' * formatting	2020-09-09 10:31:03 +02:00
Sofie Van Landeghem	60f22e1800	Pipe API (#6034 ) * ensure Language passes on valid examples for initialization * fix tagger model initialization * check for valid get_examples across components * assume labels were added before begin_training * fix senter initialization * fix morphologizer initialization * use methods to check arguments * test textcat init, requires thinc>=8.0.0a31 * fix tok2vec init * fix entity linker init * use islice * fix simple NER * cleanup debug model * fix assert statements * fix tests * throw error when adding a label if the output layer can't be resized anymore * fix test * add failing test for simple_ner * UX improvements * morphologizer UX * assume begin_training gets a representative set and processes the labels * remove assumptions for output of untrained NER model * restore test for original purpose	2020-09-08 22:44:25 +02:00
Matthew Honnibal	dae22f3dfa	Fix ignoring of punct labels	2020-09-05 14:11:59 +02:00
Ines Montani	864a697e63	Merge branch 'develop' into master-tmp	2020-09-04 13:15:36 +02:00
Ines Montani	ab1bb421ed	Update docs links in codebase	2020-09-04 12:58:50 +02:00
Ines Montani	5afe6447cd	registry.assets -> registry.misc	2020-09-03 17:31:14 +02:00
Matthew Honnibal	737a1408d9	Improve implementation of fix #6010 Follow-ups to the parser efficiency fix. * Avoid introducing new counter for number of pushes * Base cut on number of transitions, keeping it more even * Reintroduce the randomization we had in v2.	2020-09-02 14:42:32 +02:00
Matthew Honnibal	c1bf3a5602	Fix significant performance bug in parser training (#6010 ) The parser training makes use of a trick for long documents, where we use the oracle to cut up the document into sections, so that we can have batch items in the middle of a document. For instance, if we have one document of 600 words, we might make 6 states, starting at words 0, 100, 200, 300, 400 and 500. The problem is for v3, I screwed this up and didn't stop parsing! So instead of a batch of [100, 100, 100, 100, 100, 100], we'd have a batch of [600, 500, 400, 300, 200, 100]. Oops. The implementation here could probably be improved, it's annoying to have this extra variable in the state. But this'll do. This makes the v3 parser training 5-10 times faster, depending on document lengths. This problem wasn't in v2.	2020-09-02 12:57:13 +02:00
Matthew Honnibal	4cce32f090	Fix tagger initialization	2020-09-01 16:38:34 +02:00
Adriane Boyd	9130094199	Prevent Tagger model init with 0 labels (#5984 ) * Prevent Tagger model init with 0 labels Raise an error before trying to initialize a tagger model with 0 labels. * Add dummy tagger label for test * Remove tagless tagger model initializiation * Fix error number after merge * Add dummy tagger label to test * Fix formatting Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-08-31 21:24:33 +02:00
Sofie Van Landeghem	ec14744ee4	Rename Transformer listener (#6001 ) * rename to spacy-transformers.TransformerListener * add some more tok2vec tests * use select_pipes * fix docs - annotation setter was not changed in the end	2020-08-31 12:41:39 +02:00
Ines Montani	45f46a5c85	Merge pull request #5993 from explosion/feature/disabled-components	2020-08-29 15:58:41 +02:00
Ines Montani	34146750d4	Use frozen list with custom errors We don't want to break backwards compatibility too much but we also want to provide the best possible UX	2020-08-29 15:20:11 +02:00
Ines Montani	2bc31e15c9	Tidy up and auto-format [ci skip]	2020-08-29 13:01:10 +02:00
Ines Montani	f45095a666	Merge pull request #5995 from adrianeboyd/bugfix/attribute-ruler-bugfixes	2020-08-29 12:38:30 +02:00
Matthew Honnibal	58f19421b1	Return empty batch from tok2vec listener if no doc.tensor	2020-08-29 03:46:50 +02:00
Adriane Boyd	0104bd1600	Sort the AttributeRuler matches by rule order Sort the returned matches by rule order (the `match_id`) so that the rules are applied in the order they were added. This is necessary, for instance, if the `AttributeRuler` is used for the tag map and later rules require POS tags.	2020-08-28 21:01:06 +02:00
Adriane Boyd	8674b17651	Serialize AttributeRuler.patterns Serialize `AttributeRuler.patterns` instead of the individual lists to simplify the serialized and so that patterns are reloaded exactly as they were originally provided (preserving `_attrs_unnormed`).	2020-08-28 20:44:45 +02:00
Matthew Honnibal	d3ffe4ca63	Fix error when tagger was initialized with no labels	2020-08-27 18:56:58 +02:00
Matthew Honnibal	95adb58f15	Force tagger to pass batch of docs into model in begin_training	2020-08-27 03:21:03 +02:00
Adriane Boyd	90d88729e0	Add AttributeRuler.score (#5963 ) * Add AttributeRuler.score Add scoring for TAG / POS / MORPH / LEMMA if these are present in the assigned token attributes. Add default score weights (that don't really make a lot of sense) so that the scores are in the default config in some form. * Update docs	2020-08-26 15:39:30 +02:00
Sofie Van Landeghem	79d460e3a2	Weights & Biases logger for train CLI (#5971 ) * quick test as part of train script * train_logger in config, default ConsoleLogger in loggers catalogue * entitiy typo * add wandb_logger * cleanup * Update spacy/cli/train_logger.py Co-authored-by: Ines Montani <ines@ines.io> * move loggers to gold.loggers Co-authored-by: Ines Montani <ines@ines.io>	2020-08-26 15:24:33 +02:00
Sofie Van Landeghem	358cbb21e3	Define candidate generator in EL config (#5876 ) * candidate generator as separate part of EL config * update comment * ent instead of str as input for candidate generation * Span instead of str: correct type indication * fix types * unit test to create new candidate generator * fix replace_pipe argument passing * move error message, general cleanup * add vocab back to KB constructor * provide KB as callable from Vocab arg * rename to kb_loader, fix KB serialization as part of the EL pipe * fix typo * reformatting * cleanup * fix comment * fix wrongly duplicated code from merge conflict * rename dump to to_disk * from_disk instead of load_bulk * update test after recent removal of set_morphology in tagger * remove old doc	2020-08-18 16:10:36 +02:00
Ines Montani	8128e5eb35	Replace lexeme_norm warning with logging	2020-08-14 15:00:52 +02:00
Ines Montani	e4d0990857	Only receive from listener if listener exists	2020-08-14 14:58:48 +02:00
Adam Bittlingmayer	7b33b2854f	Add Armenian sentence-final verchaket, Greek question mark and Arabic question mark to default punct (#5910 ) * Add Armenian sentence-final verchaket * Add Greek and Arabic question marks, and contributor agreement * Check box	2020-08-12 15:36:14 +02:00
graue70	49e690bde1	Fix typos in comments (#5904 ) * Fix typo in comment * Fix typo * Add spaCy Contributor Agreement	2020-08-12 15:35:25 +02:00
graue70	ba84371ab0	Use init parameter (#5909 )	2020-08-11 23:41:58 +02:00
Ines Montani	950832f087	Tidy up pipes (#5906 ) * Tidy up pipes * Fix init, defaults and raise custom errors * Update docs * Update docs [ci skip] * Apply suggestions from code review Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com> * Tidy up error handling and validation, fix consistency * Simplify get_examples check * Remove unused import [ci skip] Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-08-11 23:29:31 +02:00
Ines Montani	f79e4c094d	Remove generic type Seems to cause error on Python 3.8 with Cython?	2020-08-10 17:24:30 +02:00
Ines Montani	64f2f84098	Update docstrings and docs [ci skip]	2020-08-10 13:45:22 +02:00
Ines Montani	a4b448eec4	Remove unused compiler flag	2020-08-10 13:13:18 +02:00
Ines Montani	3eaeb73342	Tidy up and auto-format	2020-08-09 22:36:23 +02:00
Ines Montani	d5c78c7a34	Update docs and fix consistency	2020-08-09 22:31:52 +02:00
Ines Montani	7c6854d8d4	Fix missing imports	2020-08-09 22:28:29 +02:00
Ines Montani	a15c5fb191	Update docstrings and docs	2020-08-09 16:10:48 +02:00
Matthew Honnibal	134d933d67	Add docstring for entity linker factory	2020-08-09 15:19:28 +02:00
Matthew Honnibal	992ee1c02f	Update tagger docstring	2020-08-09 15:09:31 +02:00
Matthew Honnibal	ebf9a7acbf	Add textcat docstring	2020-08-09 15:07:09 +02:00
Matthew Honnibal	bbd8acd4bf	Add docstrings for parser and NER. Simplify some arguments	2020-08-09 14:46:13 +02:00
Matthew Honnibal	39a3d64c01	Add docstrings for Tok2Vec component	2020-08-09 00:48:03 +02:00
Ines Montani	fe29ceec9e	Merge branch 'develop' into docs/model-docstrings	2020-08-07 18:42:01 +02:00
Ines Montani	3a193eb8f1	Fix imports, types and default configs	2020-08-07 18:40:54 +02:00
Ines Montani	6f3649923c	Merge pull request #5893 from explosion/feature/validate-arg	2020-08-07 15:47:20 +02:00
Adriane Boyd	e962784531	Add Lemmatizer and simplify related components (#5848 ) * Add Lemmatizer and simplify related components * Add `Lemmatizer` pipe with `lookup` and `rule` modes using the `Lookups` tables. * Reduce `Tagger` to a simple tagger that sets `Token.tag` (no pos or lemma) * Reduce `Morphology` to only keep track of morph tags (no tag map, lemmatizer, or morph rules) * Remove lemmatizer from `Vocab` * Adjust many many tests Differences: * No default lookup lemmas * No special treatment of TAG in `from_array` and similar required * Easier to modify labels in a `Tagger` * No extra strings added from morphology / tag map * Fix test * Initial fix for Lemmatizer config/serialization * Adjust init test to be more generic * Adjust init test to force empty Lookups * Add simple cache to rule-based lemmatizer * Convert language-specific lemmatizers Convert language-specific lemmatizers to component lemmatizers. Remove previous lemmatizer class. * Fix French and Polish lemmatizers * Remove outdated UPOS conversions * Update Russian lemmatizer init in tests * Add minimal init/run tests for custom lemmatizers * Add option to overwrite existing lemmas * Update mode setting, lookup loading, and caching * Make `mode` an immutable property * Only enforce strict `load_lookups` for known supported modes * Move caching into individual `_lemmatize` methods * Implement strict when lang is not found in lookups * Fix tables/lookups in make_lemmatizer * Reallow provided lookups and allow for stricter checks * Add lookups asset to all Lemmatizer pipe tests * Rename lookups in lemmatizer init test * Clean up merge * Refactor lookup table loading * Add helper from `load_lemmatizer_lookups` that loads required and optional lookups tables based on settings provided by a config. Additional slight refactor of lookups: * Add `Lookups.set_table` to set a table from a provided `Table` * Reorder class definitions to be able to specify type as `Table` * Move registry assets into test methods * Refactor lookups tables config Use class methods within `Lemmatizer` to provide the config for particular modes and to load the lookups from a config. * Add pipe and score to lemmatizer * Simplify Tagger.score * Add missing import * Clean up imports and auto-format * Remove unused kwarg * Tidy up and auto-format * Update docstrings for Lemmatizer Update docstrings for Lemmatizer. Additionally modify `is_base_form` API to take `Token` instead of individual features. * Update docstrings * Remove tag map values from Tagger.add_label * Update API docs * Fix relative link in Lemmatizer API docs	2020-08-07 15:27:13 +02:00

1 2 3 4 5 ...

309 Commits