spaCy

mirror of https://github.com/explosion/spaCy.git synced 2024-11-15 06:09:01 +03:00

Author	SHA1	Message	Date
svlandeg	64d90039a1	encoding UTF8	2020-09-29 10:54:42 +02:00
Ines Montani	ff9a63bfbd	begin_training -> initialize	2020-09-28 21:35:09 +02:00
Ines Montani	046f655d86	Fix error	2020-09-28 21:17:45 +02:00
Ines Montani	a139fe672b	Fix typos and refactor CLI logging	2020-09-28 21:17:10 +02:00
Ines Montani	2e9c9e74af	Fix config resolution and interpolation TODO: auto-interpolate in Thinc if config is dict (i.e. likely subsection)	2020-09-28 15:34:00 +02:00
Ines Montani	02838a1d47	Fix resolve_dot_names	2020-09-28 15:27:10 +02:00
Ines Montani	822ea4ef61	Refactor CLI	2020-09-28 15:09:59 +02:00
Ines Montani	a89e0ff7cb	Fix typo	2020-09-28 12:55:21 +02:00
Ines Montani	a62337b3f3	Tidy up vocab init	2020-09-28 12:53:06 +02:00
Ines Montani	c22ecc66bb	Don't support init path for now	2020-09-28 12:46:28 +02:00
Ines Montani	f49288ab81	Update default_config_pretraining.cfg	2020-09-28 12:31:54 +02:00
Ines Montani	a5f2cc0509	Tidy up and remove raw text (rehearsal) for now	2020-09-28 12:30:13 +02:00
Ines Montani	1590de11b1	Update config	2020-09-28 12:05:23 +02:00
Matthew Honnibal	9f6ad06452	Upd default config	2020-09-28 12:00:23 +02:00
Ines Montani	e44a7519cd	Update CLI and add [initialize] block	2020-09-28 11:56:14 +02:00
Ines Montani	d5155376fd	Update vocab init	2020-09-28 11:30:18 +02:00
Ines Montani	8b74fd19df	init pipeline -> init nlp	2020-09-28 11:13:38 +02:00
Ines Montani	2fdb7285a0	Update CLI	2020-09-28 11:06:07 +02:00
Ines Montani	553bfea641	Fix commands	2020-09-28 10:53:17 +02:00
Matthew Honnibal	44bad1474c	Add init_pipeline file	2020-09-28 09:47:34 +02:00
Matthew Honnibal	65448b2e34	Remove schema=None until Optional	2020-09-28 03:42:58 +02:00
Matthew Honnibal	b886f53c31	init-pipeline runs (maybe doesnt work)	2020-09-28 03:42:47 +02:00
Matthew Honnibal	ed2aff2db3	Remove unused train code	2020-09-28 03:12:31 +02:00
Matthew Honnibal	3a0a3b8db6	Dont hard-code for 'corpora' name	2020-09-28 03:06:33 +02:00
Matthew Honnibal	a023cf3ecc	Add (untested) resolve_dot_names util	2020-09-28 03:06:12 +02:00
Matthew Honnibal	a976da168c	Support data augmentation in Corpus (#6155 ) * Support data augmentation in Corpus * Note initial docs for data augmentation * Add augmenter to quickstart * Fix flake8 * Format * Fix test * Update spacy/tests/training/test_training.py * Improve data augmentation arguments * Update templates * Move randomization out into caller * Refactor * Update spacy/training/augment.py * Update spacy/tests/training/test_training.py * Fix augment * Fix test	2020-09-28 03:03:27 +02:00
Matthew Honnibal	13b1605ee6	Add init script	2020-09-28 01:08:49 +02:00
Matthew Honnibal	a3e1791c9c	Upd train	2020-09-28 01:08:30 +02:00
Matthew Honnibal	b5556093e2	Start updating train script	2020-09-27 23:59:44 +02:00
Ines Montani	9016d23cc5	Fix exclude and add test	2020-09-27 23:34:03 +02:00
Ines Montani	658fad428a	Fix base schema integration	2020-09-27 22:50:36 +02:00
Ines Montani	e04bd16f7f	Merge branch 'develop' into feature/new-thinc-config-resolution	2020-09-27 22:34:46 +02:00
Ines Montani	d7ad65a9bb	Fix handling of error description [ci skip]	2020-09-27 22:31:57 +02:00
Ines Montani	7e938ed63e	Update config resolution to use new Thinc	2020-09-27 22:21:31 +02:00
Adriane Boyd	013b66de05	Add tokenizer scoring to ja / ko / zh (#6152 )	2020-09-27 22:20:45 +02:00
Adriane Boyd	a6548ead17	Add _ as a symbol (#6153 ) * Add _ to StringStore in Morphology * Add _ as a symbol Add `_` as a symbol instead of adding to the `StringStore`.	2020-09-27 22:20:14 +02:00
Matthew Honnibal	39b178999c	Tmp notes	2020-09-27 20:13:38 +02:00
Adriane Boyd	8393dbedad	Minor fixes * Put `cfg` back in serialization * Add `pickle5` to pytest conf	2020-09-27 15:15:53 +02:00
Adriane Boyd	54fe871935	Fix formatting, refactor pickle5 exceptions	2020-09-27 14:37:28 +02:00
Adriane Boyd	11e195d3ed	Update ChineseTokenizer * Allow `pkuseg_model` to be set to `None` on initialization * Don't save config within tokenizer * Force convert pkuseg_model to use pickle protocol 4 by reencoding with `pickle5` on serialization * Update pkuseg serialization test	2020-09-27 14:00:18 +02:00
Ines Montani	b4486d747d	Merge branch 'develop' into fix/train-config-interpolation	2020-09-26 15:32:14 +02:00
Ines Montani	8fea06d55e	Merge pull request #6149 from adrianeboyd/feature/attributeruler-match-ids Simplify string match IDs for AttributeRuler	2020-09-26 15:31:30 +02:00
Ines Montani	b2d07de786	Construct nlp from uninterpolated config before training	2020-09-26 15:16:59 +02:00
Ines Montani	ca3c997062	Improve CLI config validation with latest Thinc	2020-09-26 13:13:57 +02:00
Adriane Boyd	6c25e60089	Simplify string match IDs for AttributeRuler	2020-09-26 11:12:39 +02:00
Matthew Honnibal	702edf52a0	Fix attributeruler	2020-09-26 00:30:48 +02:00
Matthew Honnibal	821f37254c	Fix attributeruler	2020-09-26 00:19:53 +02:00
Matthew Honnibal	98327f66a9	Fix attributeruler key	2020-09-25 23:20:50 +02:00
Matthew Honnibal	092ce4648e	Make DocBin output stable data (set iteration)	2020-09-25 22:20:44 +02:00
Matthew Honnibal	26afd3bd90	Fix iteration order	2020-09-25 21:47:22 +02:00
Matthew Honnibal	3d8388969e	Sort paths for cache consistency	2020-09-25 19:07:26 +02:00
Adriane Boyd	c3b5a3cfff	Clean up MorphAnalysisC struct (#6146 )	2020-09-25 15:56:48 +02:00
Sofie Van Landeghem	009ba14aaf	Fix pretraining in train script (#6143 ) * update pretraining API in train CLI * bump thinc to 8.0.0a35 * bump to 3.0.0a26 * doc fixes * small doc fix	2020-09-25 15:47:10 +02:00
Adriane Boyd	50f20cf722	Revert changes to Scorer.score_spans	2020-09-25 08:21:47 +02:00
Matthew Honnibal	93d7ff309f	Remove print	2020-09-24 21:05:27 +02:00
Matthew Honnibal	16475528f7	Fix skipped documents in entity scorer (#6137 ) * Fix skipped documents in entity scorer * Add back the skipping of unannotated entities * Update spacy/scorer.py * Use more specific NER scorer * Fix import * Fix get_ner_prf * Add scorer * Fix scorer Co-authored-by: Ines Montani <ines@ines.io>	2020-09-24 20:38:57 +02:00
Matthew Honnibal	2abb4ba9db	Make a pre-check to speed up alignment cache (#6139 ) * Dirty trick to fast-track alignment cache * Improve alignment cache check * Fix header * Fix align cache * Fix align logic	2020-09-24 18:13:39 +02:00
Ines Montani	26e28ed413	Fix combined scores if multiple components report it	2020-09-24 17:11:13 +02:00
Ines Montani	0b52b6904c	Update entity_linker.py	2020-09-24 17:10:35 +02:00
Ines Montani	20b89a9717	Increment version [ci skip]	2020-09-24 16:57:02 +02:00
Adriane Boyd	3c062b3911	Add MORPH handling to Matcher (#6107 ) * Add MORPH handling to Matcher * Add `MORPH` to `Matcher` schema * Rename `_SetMemberPredicate` to `_SetPredicate` * Add `ISSUBSET` and `ISSUPERSET` operators to `_SetPredicate` * Add special handling for normalization and conversion of morph values into sets * For other attrs, `ISSUBSET` acts like `IN` and `ISSUPERSET` only matches for 0 or 1 values * Update test * Rename to IS_SUBSET and IS_SUPERSET	2020-09-24 16:55:09 +02:00
Adriane Boyd	59340606b7	Add option to disable Matcher errors (#6125 ) * Add option to disable Matcher errors * Add option to disable Matcher errors when a doc doesn't contain a particular type of annotation Minor additional change: * Update `AttributeRuler.load_from_morph_rules` to allow direct `MORPH` values * Rename suppress_errors to allow_missing Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com> * Refactor annotation checks in Matcher and PhraseMatcher Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-09-24 16:54:39 +02:00
Sofie Van Landeghem	c7eedd3534	updates to NEL functionality (#6132 ) * NEL: read sentences and ents from reference * fiddling with sent_start annotations * add KB serialization test * KB write additional file with strings.json * score_links function to calculate NEL P/R/F * formatting * documentation	2020-09-24 16:53:59 +02:00
Ines Montani	d0ef4a4cf5	Prevent division by zero in score weights	2020-09-24 16:42:13 +02:00
Matthew Honnibal	74ee456374	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2020-09-24 16:11:47 +02:00
Matthew Honnibal	0bc214c102	Fix pull	2020-09-24 16:11:33 +02:00
Ines Montani	3f751e68f5	Increment version [ci skip]	2020-09-24 14:45:41 +02:00
Ines Montani	58dde293ce	Merge pull request #6089 from adrianeboyd/feature/doc-ents-v3-2	2020-09-24 14:44:42 +02:00
Ines Montani	74e1f192b4	Merge pull request #6134 from explosion/feature/training_before_to_disk	2020-09-24 14:44:11 +02:00
Ines Montani	24e7ac3f2b	Fix download CLI [ci skip]	2020-09-24 14:43:56 +02:00
Ines Montani	88e54caa12	accuracy -> performance	2020-09-24 14:32:35 +02:00
Ines Montani	92f8b6959a	Fix typo	2020-09-24 13:48:41 +02:00
Adriane Boyd	5c13e0cf1b	Remove unused error	2020-09-24 13:41:55 +02:00
Ines Montani	be56c0994b	Add [training.before_to_disk] callback	2020-09-24 12:40:25 +02:00
Adriane Boyd	8eaacaae97	Refactor Doc.ents setter to use Doc.set_ents Additional changes: * Entity spans with missing labels are ignored * Fix ent_kb_id setting in `Doc.set_ents`	2020-09-24 12:36:51 +02:00
Ines Montani	c6c67b606e	Merge pull request #6133 from explosion/fix/score_weights	2020-09-24 12:00:57 +02:00
Ines Montani	f69fea8b25	Improve error handling around non-number scores	2020-09-24 11:29:07 +02:00
Ines Montani	4eb39b5c43	Fix logging	2020-09-24 11:04:35 +02:00
Ines Montani	4bbe41f017	Fix combined scores and update test	2020-09-24 10:42:47 +02:00
Sofie Van Landeghem	c645c4e7ce	fix micro PRF for textcat (#6130 ) * fix micro PRF for textcat * small fix	2020-09-24 10:31:17 +02:00
Matthew Honnibal	17a6b0a173	Make project pull order insensitive (#6131 )	2020-09-24 10:30:42 +02:00
Ines Montani	ae51f580c1	Fix handling of score_weights	2020-09-24 10:27:33 +02:00
Ines Montani	f25f05c503	Adjust sort order [ci skip]	2020-09-23 20:03:04 +02:00
Ines Montani	3f77eb749c	Increment version [ci skip]	2020-09-23 19:50:15 +02:00
svlandeg	b816ace4bb	format	2020-09-23 17:33:13 +02:00
svlandeg	5a9fdbc8ad	state_type as Literal	2020-09-23 17:32:14 +02:00
svlandeg	35dbc63578	Merge remote-tracking branch 'upstream/develop' into fix/nr_features # Conflicts: # spacy/ml/models/parser.py # spacy/tests/serialize/test_serialize_config.py # website/docs/api/architectures.md	2020-09-23 17:01:13 +02:00
svlandeg	25b34bba94	throw custom error when state_type is invalid	2020-09-23 16:57:14 +02:00
Ines Montani	916050bf2f	Merge pull request #6127 from explosion/feature/literal-nr_feature_tokens	2020-09-23 16:56:08 +02:00
Ines Montani	3c3863654e	Increment version [ci skip]	2020-09-23 16:54:43 +02:00
svlandeg	dd2292793f	'parser' instead of 'deps' for state_type	2020-09-23 16:53:49 +02:00
Ines Montani	50a4425cda	Adjust docs	2020-09-23 16:03:32 +02:00
Ines Montani	76bbed3466	Use Literal type for nr_feature_tokens	2020-09-23 16:00:03 +02:00
Muhammad Fahmi Rasyid	7489d02dea	Update Indonesian Example Phrases (#6124 ) * create contributor agreement * Update Indonesian example. (see #1107) Update Indonesian examples with more proper phrases. the current phrases contains sensitive and violent words.	2020-09-23 14:02:26 +02:00
svlandeg	6c85fab316	state_type and extra_state_tokens instead of nr_feature_tokens	2020-09-23 13:35:09 +02:00
Ines Montani	7745d77a38	Fix whitespace in template [ci skip]	2020-09-23 13:21:42 +02:00
svlandeg	6435458d51	simplify expression	2020-09-23 12:12:38 +02:00
svlandeg	20b0ec5dcf	avoid logging performance of frozen components	2020-09-23 10:37:12 +02:00
Ines Montani	ae5dacf75f	Tidy up and add types	2020-09-23 10:14:34 +02:00
Ines Montani	6ca06cb62c	Update docs and formatting [ci skip]	2020-09-23 10:14:27 +02:00
Ines Montani	888f936a73	Merge pull request #6106 from svlandeg/feature/textcat-quickstart	2020-09-23 10:11:45 +02:00
Ines Montani	60a317520a	Merge pull request #6109 from svlandeg/feature/2rename	2020-09-23 09:47:12 +02:00
Ines Montani	f976bab710	Remove empty file [ci skip]	2020-09-23 09:30:09 +02:00
svlandeg	556f3e4652	add pooling to NEL's TransformerListener	2020-09-23 09:24:28 +02:00
svlandeg	4a56ea72b5	fallbacks for old names	2020-09-23 09:15:07 +02:00
Sofie Van Landeghem	86a08f819d	tok2vec.update instead of predict (#6113 )	2020-09-22 21:54:52 +02:00
Adriane Boyd	e4acb28658	Fix norm in retokenizer split (#6111 ) Parallel to behavior in merge, reset norm on original token in retokenizer split.	2020-09-22 21:53:33 +02:00
Sofie Van Landeghem	e0e793be4d	fix KB IO (#6118 )	2020-09-22 21:53:06 +02:00
Adriane Boyd	9b4979407d	Fix overlapping German noun chunks (#6112 ) Add a similar fix as in #5470 to prevent the German noun chunks iterator from producing overlapping spans.	2020-09-22 21:52:42 +02:00
Adriane Boyd	b1a7d6c528	Refactor seen token detection	2020-09-22 14:42:51 +02:00
Sofie Van Landeghem	d53c84b6d6	avoid None callback (#6100 )	2020-09-22 13:54:44 +02:00
Adriane Boyd	535842e483	Merge branch 'develop' into feature/doc-ents-v3-2	2020-09-22 13:45:50 +02:00
Ines Montani	5e3b796b12	Validate section refs in debug config	2020-09-22 12:24:39 +02:00
svlandeg	085a1c8e2b	add no_output_layer to TextCatBOW config	2020-09-22 12:06:40 +02:00
svlandeg	e1b8090b9b	few more fixes	2020-09-22 12:01:06 +02:00
svlandeg	b556a10808	rename converts in_to_out	2020-09-22 11:50:19 +02:00
svlandeg	e931f4d757	add textcat score	2020-09-22 10:56:43 +02:00
svlandeg	396b33257f	add entity_linker to jinja template	2020-09-22 10:40:05 +02:00
Ines Montani	db7126ead9	Increment version	2020-09-22 10:31:26 +02:00
svlandeg	135de82a2d	add textcat to quickstart	2020-09-22 10:22:06 +02:00
Ines Montani	6316d5f398	Improve messages in project CLI [ci skip]	2020-09-22 09:45:34 +02:00
Ines Montani	49e80dbcac	Merge pull request #6103 from explosion/chore/tidy-up-tests-docs-get-doc	2020-09-22 09:45:04 +02:00
Ines Montani	81606b29bd	Merge pull request #6104 from svlandeg/fix/debug_model [ci skip]	2020-09-22 09:31:23 +02:00
Ines Montani	beb766d0a0	Add test	2020-09-22 09:15:57 +02:00
Ines Montani	285fa934d8	Merge branch 'chore/tidy-up-tests-docs-get-doc' of https://github.com/explosion/spaCy into chore/tidy-up-tests-docs-get-doc	2020-09-22 09:10:14 +02:00
Ines Montani	69f7e52c26	Update README.md	2020-09-22 09:10:06 +02:00
svlandeg	45b29c4a5b	cleanup	2020-09-21 23:17:23 +02:00
svlandeg	fa5c416db6	initialize through nlp object and with train_corpus	2020-09-21 23:09:22 +02:00
Matthew Honnibal	3abc4a5adb	Slightly tidy doc.ents.__set__	2020-09-21 22:58:03 +02:00
Ines Montani	67fbcb3da5	Tidy up tests and docs	2020-09-21 20:43:54 +02:00
Ines Montani	a5f6ab4943	Merge pull request #6098 from adrianeboyd/feature/doc-init	2020-09-21 18:35:20 +02:00
Adriane Boyd	f212303729	Add sent_starts to Doc.__init__ Add sent_starts to `Doc.__init__`. Officially specify `is_sent_start` values but also convert to and accept `sent_start` internally.	2020-09-21 17:59:09 +02:00
svlandeg	447b3e5787	Merge remote-tracking branch 'upstream/develop' into fix/debug_model # Conflicts: # spacy/cli/debug_model.py	2020-09-21 16:58:40 +02:00
Ines Montani	b3327c1e45	Increment version [ci skip]	2020-09-21 16:04:30 +02:00
Ines Montani	e8bcaa44f1	Don't auto-decompress archives with smart_open [ci skip]	2020-09-21 16:01:46 +02:00
Adriane Boyd	6aa91c7ca0	Make user_data keyword-only	2020-09-21 16:00:06 +02:00
Adriane Boyd	177df15d89	Implement Doc.set_ents	2020-09-21 15:54:05 +02:00
Adriane Boyd	13fbf6556a	Merge remote-tracking branch 'upstream/develop' into feature/doc-ents-v3-2	2020-09-21 14:42:04 +02:00
svlandeg	eb9b447960	Merge remote-tracking branch 'upstream/develop' into fix/debug_model # Conflicts: # spacy/cli/debug_model.py	2020-09-21 14:05:16 +02:00
Adriane Boyd	ce455f30ca	Fix formatting	2020-09-21 13:53:29 +02:00
Adriane Boyd	bc02e86494	Extend Doc.__init__ with additional annotation Mostly copying from `spacy.tests.util.get_doc`, add additional kwargs to `Doc.__init__` to initialize the most common doc/token values.	2020-09-21 13:36:24 +02:00
Ines Montani	758ead8a47	Sync overrides with CLI overrides	2020-09-21 12:50:13 +02:00
Ines Montani	5497acf49a	Support config overrides via environment variables	2020-09-21 11:25:10 +02:00
Ines Montani	1114219ae3	Tidy up and auto-format	2020-09-21 10:59:07 +02:00
Ines Montani	b2302c0a1c	Improve error for missing dependency	2020-09-20 17:44:51 +02:00
Matthew Honnibal	8fb59d958c	Format	2020-09-20 16:31:48 +02:00
Matthew Honnibal	dc22771f87	Fix sparse checkout	2020-09-20 16:30:05 +02:00
Matthew Honnibal	a0fb5e50db	Use simple git clone call if not sparse	2020-09-20 16:22:04 +02:00
Matthew Honnibal	2c24d633d0	Use updated run_command	2020-09-20 16:21:43 +02:00
Matthew Honnibal	889128e5c5	Improve error handling in run_command	2020-09-20 16:20:57 +02:00
Ines Montani	554c9a2497	Update docs [ci skip]	2020-09-20 12:30:53 +02:00
svlandeg	6db1d5dc0d	trying some stuff	2020-09-19 19:11:30 +02:00
Ines Montani	e863b3dc14	Merge pull request #6092 from adrianeboyd/bugfix/load-vocab-lookups-2	2020-09-19 12:33:38 +02:00
Sofie Van Landeghem	39872de1f6	Introducing the gpu_allocator (#6091 ) * rename 'use_pytorch_for_gpu_memory' to 'gpu_allocator' * --code instead of --code-path * update documentation * avoid querying the "system" section directly * add explanation of gpu_allocator to TF/PyTorch section in docs * fix typo * fix typo 2 * use set_gpu_allocator from thinc 8.0.0a34 * default null instead of empty string	2020-09-19 01:17:02 +02:00
Adriane Boyd	47080fba98	Minor renaming / refactoring * Rename loader to `spacy.LookupsDataLoader.v1`, add debugging message * Make `Vocab.lookups` a property	2020-09-18 19:43:19 +02:00
svlandeg	73ff52b9ec	hack for tok2vec listener	2020-09-18 16:43:15 +02:00
Adriane Boyd	eed4b785f5	Load vocab lookups tables at beginning of training Similar to how vectors are handled, move the vocab lookups to be loaded at the start of training rather than when the vocab is initialized, since the vocab doesn't have access to the full config when it's created. The option moves from `nlp.load_vocab_data` to `training.lookups`. Typically these tables will come from `spacy-lookups-data`, but any `Lookups` object can be provided. The loading from `spacy-lookups-data` is now strict, so configs for each language should specify the exact tables required. This also makes it easier to control whether the larger clusters and probs tables are included. To load `lexeme_norm` from `spacy-lookups-data`: ``` [training.lookups] @misc = "spacy.LoadLookupsData.v1" lang = ${nlp.lang} tables = ["lexeme_norm"] ```	2020-09-18 15:59:16 +02:00
Ines Montani	a127fa475e	Merge pull request #6078 from svlandeg/fix/corpus	2020-09-18 14:44:21 +02:00
Matthew Honnibal	bbdb5f62b7	Temporary work-around for scoring a subset of components (#6090 ) * Try hacking the scorer to work around sentence boundaries * Upd scorer * Set dev version * Upd scorer hack * Fix version * Improve comment on hack	2020-09-18 14:26:42 +02:00
Adriane Boyd	a88106e852	Remove W106: HEAD and SENT_START in doc.from_array (#6086 ) * Remove W106: HEAD and SENT_START in doc.from_array This warning was hacky and being triggered too often. * Fix test	2020-09-18 03:01:29 +02:00
svlandeg	e4fc7e0222	fixing output sample to proper 2D array	2020-09-17 22:34:36 +02:00
Adriane Boyd	8b650f3a78	Modify setting missing and blocked entity tokens In order to make it easier to construct `Doc` objects as training data, modify how missing and blocked entity tokens are set to prioritize setting `O` and missing entity tokens for training purposes over setting blocked entity tokens. * `Doc.ents` setter sets tokens outside entity spans to `O` regardless of the current state of each token * For `Doc.ents`, setting a span with a missing label sets the `ent_iob` to missing instead of blocked * `Doc.block_ents(spans)` marks spans as hard `O` for use with the `EntityRecognizer`	2020-09-17 21:27:42 +02:00
Ines Montani	3865214343	Use consistent shortcut	2020-09-17 16:57:02 +02:00
svlandeg	35a3931064	fix typo	2020-09-17 16:36:27 +02:00
svlandeg	ddfc1fc146	add pretraining option to init config	2020-09-17 16:05:40 +02:00
svlandeg	427dbecdd6	cleanup and formatting	2020-09-17 11:48:04 +02:00
svlandeg	0c35885751	generalize corpora, dot notation for dev and train corpus	2020-09-17 11:38:59 +02:00
svlandeg	781fae678b	Merge remote-tracking branch 'upstream/develop' into fix/corpus	2020-09-17 09:24:36 +02:00
Matthew Honnibal	8303d101a5	Set version to v3.0.0a19	2020-09-17 00:18:49 +02:00
Adriane Boyd	7e4cd7575c	Refactor Docs.is_ flags (#6044 ) * Refactor Docs.is_ flags * Add derived `Doc.has_annotation` method * `Doc.has_annotation(attr)` returns `True` for partial annotation * `Doc.has_annotation(attr, require_complete=True)` returns `True` for complete annotation * Add deprecation warnings to `is_tagged`, `is_parsed`, `is_sentenced` and `is_nered` * Add `Doc._get_array_attrs()`, which returns a full list of `Doc` attrs for use with `Doc.to_array`, `Doc.to_bytes` and `Doc.from_docs`. The list is the `DocBin` attributes list plus `SPACY` and `LENGTH`. Notes on `Doc.has_annotation`: * `HEAD` is converted to `DEP` because heads don't have an unset state * Accept `IS_SENT_START` as a synonym of `SENT_START` Additional changes: * Add `NORM`, `ENT_ID` and `SENT_START` to default attributes for `DocBin` * In `Doc.from_array()` the presence of `DEP` causes `HEAD` to override `SENT_START` * In `Doc.from_array()` using `attrs` other than `Doc._get_array_attrs()` (i.e., a user's custom list rather than our default internal list) with both `HEAD` and `SENT_START` shows a warning that `HEAD` will override `SENT_START` * `set_children_from_heads` does not require dependency labels to set sentence boundaries and sets `sent_start` for all non-sentence starts to `-1` * Fix call to set_children_form_heads Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-09-17 00:14:01 +02:00
Adriane Boyd	a119667a36	Clean up spacy.tokens (#6046 ) * Clean up spacy.tokens * Update `set_children_from_heads`: * Don't check `dep` when setting lr_* or sentence starts * Set all non-sentence starts to `False` * Use `set_children_from_heads` in `Token.head` setter * Reduce similar/duplicate code (admittedly adds a bit of overhead) * Update sentence starts consistently * Remove unused `Doc.set_parse` * Minor changes: * Declare cython variables (to avoid cython warnings) * Clean up imports * Modify set_children_from_heads to set token range Modify `set_children_from_heads` so that it adjust tokens within a specified range rather then the whole document. Modify the `Token.head` setter to adjust only the tokens affected by the new head assignment.	2020-09-16 20:32:38 +02:00
Matthew Honnibal	c776594ab1	Fix	2020-09-16 18:15:14 +02:00
Matthew Honnibal	4a573d18b3	Add comment	2020-09-16 17:51:29 +02:00
Matthew Honnibal	d31afc8334	Fix Language.link_components when model is None	2020-09-16 17:49:48 +02:00
Adriane Boyd	f3db3f6fe0	Add vectors option to CharacterEmbed (#6069 ) * Add vectors option to CharacterEmbed * Update spacy/pipeline/morphologizer.pyx * Adjust default morphologizer config Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-09-16 17:45:04 +02:00
Adriane Boyd	d722a439aa	Remove unneeded methods in senter and morphologizer (#6074 ) Now that the tagger doesn't manage the tag map, the child classes senter and morphologizer don't need to override the serialization methods.	2020-09-16 17:39:41 +02:00
Adriane Boyd	87c329c711	Set rule-based lemmatizers as default (#6076 ) For languages without provided models and with lemmatizer rules in `spacy-lookups-data`, make the rule-based lemmatizer the default: Bengali, Persian, Norwegian, Swedish	2020-09-16 17:37:29 +02:00
svlandeg	1040e250d8	actual commit with test for custom readers with ml_datasets >= 0.2	2020-09-16 16:41:28 +02:00
svlandeg	714a5a05c6	test for custom readers with ml_datasets >= 0.2	2020-09-16 16:39:55 +02:00
svlandeg	0d1392340f	Merge remote-tracking branch 'upstream/develop' into fix/corpus	2020-09-15 23:17:08 +02:00
svlandeg	f420aa1138	use e.value to get to the ExceptionInfo value	2020-09-15 22:30:09 +02:00
svlandeg	7336657662	corpus is a Dict	2020-09-15 22:07:16 +02:00
svlandeg	51fa929f47	rewrite train_corpus to corpus.train in config	2020-09-15 21:58:04 +02:00
svlandeg	bd87e8686e	move tests to correct subdir	2020-09-15 21:40:38 +02:00
Ines Montani	aaf01689a1	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2020-09-15 14:24:42 +02:00
Ines Montani	91a6637f74	Remove extra pipe config values before merging	2020-09-15 14:24:17 +02:00
Ines Montani	d3d7f92f05	Fix lang check and error handling in Language.from_config	2020-09-15 14:24:06 +02:00
Ines Montani	2ed6e2a218	Auto-format	2020-09-15 14:20:04 +02:00
Ines Montani	2214d1bb7b	Merge pull request #6067 from explosion/feature/spacy-blank-from-config	2020-09-15 14:18:33 +02:00
Ines Montani	253ba5ef14	Raise for bad Vocab values	2020-09-15 13:25:34 +02:00
svlandeg	7677e5c0e2	fix wandb logger when calling multiple times from same script	2020-09-15 12:56:33 +02:00
Ines Montani	eff9406718	Support vocab arg in spacy.blank	2020-09-15 11:39:36 +02:00
Ines Montani	99549a5ace	Fix consistency and update docs	2020-09-15 11:37:37 +02:00
Ines Montani	7dfc4bc062	Allow overriding meta from spacy.blank	2020-09-15 11:12:12 +02:00
Ines Montani	0f943157af	Delegate to Language.from_config in spacy.blank	2020-09-15 11:07:55 +02:00
Ines Montani	e977086a9a	Update default pretraining config [ci skip]	2020-09-15 01:12:02 +02:00
Ines Montani	154752f9c2	Update docs and consistency [ci skip]	2020-09-15 00:32:49 +02:00
Ines Montani	9cc304c194	Merge pull request #6064 from explosion/fix/sparse-checkout-ux Fix sparse checkout and error handling	2020-09-15 00:32:20 +02:00
Matthew Honnibal	475323cd36	Set version to v3.0.0a18	2020-09-14 22:05:43 +02:00
Matthew Honnibal	e8378b57bc	Fix test	2020-09-14 21:21:13 +02:00
Matthew Honnibal	adf0bab23a	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2020-09-14 21:04:49 +02:00
Matthew Honnibal	ae15fa9688	Fix iob converter	2020-09-14 21:02:18 +02:00
Sofie Van Landeghem	3216a33149	positive_label config for textcat (#6062 ) * hook up positive_label in textcat * unit tests * documentation * formatting * tests * fix typo * move verify_config to after begin_training * revert accidential commit	2020-09-14 17:08:00 +02:00
Ines Montani	c052017025	Fix sparse checkout and error handling	2020-09-14 14:12:58 +02:00
Matthew Honnibal	fdd2340f6c	Set version to v3.0.0a17	2020-09-13 23:52:03 +02:00
Ines Montani	416deb412f	Prevent duplicate traceback on CalledProcessError [ci skip]	2020-09-13 19:28:54 +02:00
Ines Montani	61a4ef0b46	Fix syntax error	2020-09-13 19:23:09 +02:00
Matthew Honnibal	b693d2d224	Fix speed report in table	2020-09-13 17:39:31 +02:00
Sofie Van Landeghem	744df9814a	define threshold for scoring textcat in TextCat config (#6055 ) * define threshold for scoring textcat in TextCat config * fix unit test and documentation	2020-09-13 14:15:52 +02:00
Adriane Boyd	ab270364f1	Modify Token.morph to enable unsetting (#6043 ) Modify `Token.morph` property so that `Token.c.morph` can be reset back to an internal value of `0`. Allow setting `Token.morph` from a hash as long as the morph string is already in the `StringStore`, setting it indirectly through `Token.morph_` so that the value is added to the morphology. If the hash is not in the `StringStore`, raise an error.	2020-09-13 14:06:07 +02:00
Adriane Boyd	c7bd631b5f	Fix token.idx for special cases with affixes (#6035 )	2020-09-13 14:05:36 +02:00
Matthew Honnibal	54c40223a1	Improve v3 pretrain command (#6040 ) * Starts to run * Update pretrain script * Update corpus * Update pretrain schema * Remove outdated test * Make JsonlTexts produce Example objects.	2020-09-13 14:05:05 +02:00
Ines Montani	febb99916d	Tidy up and auto-format [ci skip]	2020-09-13 10:55:36 +02:00
Ines Montani	a5633b205f	Fix handling of errors around git [ci skip]	2020-09-13 10:52:28 +02:00
Ines Montani	f8846c198d	Update types and docstrings	2020-09-13 10:52:02 +02:00
Sofie Van Landeghem	e92e850c72	Raise if empty examples (#6052 ) * raise error if no valid Example objects were found during initialization * fix max_length parameter * remove commit from other branch Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-09-12 21:01:53 +02:00
Matthew Honnibal	37347830d4	Fix reading in GloVe vectors	2020-09-12 17:31:18 +02:00
Ines Montani	b41be87213	Merge pull request #6051 from svlandeg/feature/cli-config	2020-09-12 17:12:35 +02:00
Ines Montani	eedaaaec75	Fix handling of existing asset without checksum [ci skip]	2020-09-12 17:02:53 +02:00
svlandeg	a75cfe0da6	Merge remote-tracking branch 'upstream/develop' into feature/cli-config	2020-09-12 14:44:40 +02:00
svlandeg	115147804a	string_to_list to parse comma-separated string into a list	2020-09-12 14:43:22 +02:00
Ines Montani	f886f5bbc8	Merge pull request #6048 from explosion/fix/clone-compat	2020-09-12 10:30:49 +02:00
svlandeg	711166a75a	prevent overwriting score_weights	2020-09-11 15:12:05 +02:00
Ines Montani	62eec33bc4	Fix meta.json validation	2020-09-11 11:38:33 +02:00
Ines Montani	0b2e07215d	Support overwriting name on spacy package	2020-09-11 11:38:28 +02:00
svlandeg	5b94aeece9	support pipeline as "list in string"	2020-09-11 11:08:46 +02:00
Ines Montani	1bce432b4a	Adjust message [ci skip]	2020-09-11 10:00:49 +02:00
Ines Montani	5acd4fbcd8	Merge branch 'develop' into fix/clone-compat	2020-09-11 09:58:30 +02:00
Ines Montani	761bd60d43	Adjust info message	2020-09-11 09:57:00 +02:00
Ines Montani	6831161bfa	Resolve path to be extra sure	2020-09-11 09:56:49 +02:00
svlandeg	1723fb73c4	remove brol	2020-09-10 17:44:59 +02:00
svlandeg	08a831ce83	process trailing slash if any	2020-09-10 17:39:52 +02:00
Ines Montani	3e83a509bb	WIP: fix project clone compatibility	2020-09-10 15:49:13 +02:00
svlandeg	f1bc09c1e9	restore partly	2020-09-10 14:53:02 +02:00
svlandeg	3889747119	asset fix & UX	2020-09-10 14:36:53 +02:00
svlandeg	a36766d153	hookup branch	2020-09-10 12:00:34 +02:00
svlandeg	97d99f7efa	Merge remote-tracking branch 'upstream/develop' into feature/doc-fixes	2020-09-10 11:51:34 +02:00
Ines Montani	908f3a4494	Update default projects repo [ci skip]	2020-09-10 11:42:14 +02:00
svlandeg	92f9d2f406	small UX fixes	2020-09-10 11:35:50 +02:00
svlandeg	1fc5486792	more fine-grained errors for git_sparse_checkout	2020-09-10 11:31:32 +02:00
Ines Montani	15bc3a37b4	Add --branch to project clone	2020-09-10 11:08:15 +02:00
Ines Montani	1955aaaa20	Merge pull request #6045 from svlandeg/feature/more-layers-docs [ci skip]	2020-09-09 21:46:40 +02:00
Sofie Van Landeghem	cb66ea7400	Remove simple_ner code (#6041 ) * remove simple_ner code * remove unused _biluo and _iob files	2020-09-09 16:11:27 +02:00
svlandeg	39aa740777	Merge remote-tracking branch 'upstream/develop' into feature/more-layers-docs	2020-09-09 11:59:34 +02:00
Sofie Van Landeghem	8e7557656f	Renaming gold & annotation_setter (#6042 ) * version bump to 3.0.0a16 * rename "gold" folder to "training" * rename 'annotation_setter' to 'set_extra_annotations' * formatting	2020-09-09 10:31:03 +02:00
Sofie Van Landeghem	60f22e1800	Pipe API (#6034 ) * ensure Language passes on valid examples for initialization * fix tagger model initialization * check for valid get_examples across components * assume labels were added before begin_training * fix senter initialization * fix morphologizer initialization * use methods to check arguments * test textcat init, requires thinc>=8.0.0a31 * fix tok2vec init * fix entity linker init * use islice * fix simple NER * cleanup debug model * fix assert statements * fix tests * throw error when adding a label if the output layer can't be resized anymore * fix test * add failing test for simple_ner * UX improvements * morphologizer UX * assume begin_training gets a representative set and processes the labels * remove assumptions for output of untrained NER model * restore test for original purpose	2020-09-08 22:44:25 +02:00
svlandeg	d0a8849e4d	fix typo	2020-09-08 18:32:12 +02:00
svlandeg	bd8f9b188b	small fixes	2020-09-08 17:24:36 +02:00
Matthew Honnibal	4b82882767	Fix defaults	2020-09-08 15:31:21 +02:00
Matthew Honnibal	5d09e3e154	Set version to v3.0.0a15	2020-09-08 15:25:10 +02:00
Matthew Honnibal	ba5f4c9b32	Add words and seconds to train info	2020-09-08 15:24:47 +02:00
Matthew Honnibal	b470062153	Add CLI registry (#6037 )	2020-09-08 15:23:34 +02:00
svlandeg	06ef66fd73	Merge remote-tracking branch 'upstream/develop' into feature/more-layers-docs	2020-09-08 10:28:42 +02:00
Matthew Honnibal	dae22f3dfa	Fix ignoring of punct labels	2020-09-05 14:11:59 +02:00
Matthew Honnibal	12e1279f6b	Set version to v3.0.0a14	2020-09-05 04:13:53 +02:00
Matthew Honnibal	4b7abaafdb	Fix learn rate for non-transformer	2020-09-04 21:22:50 +02:00
Matthew Honnibal	465785a672	Fix project pull and push	2020-09-04 21:15:55 +02:00
Ines Montani	f174c7b1f3	Merge branch 'develop' into pr/6018	2020-09-04 15:54:49 +02:00
Ines Montani	f06eed800e	Merge pull request #6029 from explosion/master-tmp	2020-09-04 15:11:55 +02:00
Ines Montani	f9550b4493	Fix components in meta.json and website [ci skip]	2020-09-04 14:42:12 +02:00
Ines Montani	d7cc2ee72d	Fix tests	2020-09-04 14:05:55 +02:00
Ines Montani	90043a6f9b	Tidy up and auto-format	2020-09-04 13:42:33 +02:00
Ines Montani	df0b68f60e	Remove unicode declarations and update language data	2020-09-04 13:19:16 +02:00
Ines Montani	ba600f91c5	Tidy up imports	2020-09-04 13:15:44 +02:00
Ines Montani	864a697e63	Merge branch 'develop' into master-tmp	2020-09-04 13:15:36 +02:00
Adriane Boyd	b927893309	Merge branch 'develop' into feature/dependency-matcher-v3	2020-09-04 13:03:30 +02:00
Ines Montani	ab1bb421ed	Update docs links in codebase	2020-09-04 12:58:50 +02:00
holubvl3	0a27fca557	Create examples.py (#5985 ) * Create examples.py * Create tag_map.py * Delete tag_map.py * Update examples.py formatting: add empty line Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2020-09-04 11:00:14 +02:00
Ines Montani	2189046869	Merge pull request #6024 from explosion/chore/registry-renaming	2020-09-04 10:54:10 +02:00
svlandeg	c32fcdf4c9	fix typo	2020-09-04 09:10:21 +02:00
Ines Montani	595f9dc2e4	Make displacy color registry consistent with others This was the only registry that expected the registered objects to be dictionaries instead of functions that return something. We can still support plain dicts but we should also support functions for consistency	2020-09-03 23:05:41 +02:00
Matthew Honnibal	1c07820681	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2020-09-03 18:54:21 +02:00
Matthew Honnibal	7be8a0516a	Fix project pull	2020-09-03 18:54:03 +02:00
Ines Montani	23b7d9cfa3	Prefix span getters	2020-09-03 17:37:06 +02:00
Ines Montani	5afe6447cd	registry.assets -> registry.misc	2020-09-03 17:31:14 +02:00
Ines Montani	c063e55eb7	Add prefix to batchers	2020-09-03 17:30:41 +02:00
Ines Montani	896caf45e3	Merge pull request #6023 from explosion/ux/model-terminology-consistency [ci skip]	2020-09-03 17:13:44 +02:00
Ines Montani	c53b1433b9	Adjust more arguments [ci skip]	2020-09-03 17:12:24 +02:00
Ines Montani	b5a0657fd6	"model" terminology consistency in docs	2020-09-03 13:13:03 +02:00
Matthew Honnibal	f038841798	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2020-09-03 12:52:39 +02:00
Matthew Honnibal	ef0d0630a4	Let Langugae.use_params work with falsey inputs The Language.use_params method was failing if you passed in None, which meant we had to use awkward conditionals for the parameter averaging. This solves the problem.	2020-09-03 12:51:04 +02:00
Yohei Tamura	5af432e0f2	fix for empty string (#5936 )	2020-09-03 10:09:03 +02:00
Adriane Boyd	77ac4a38aa	Simplify specials and cache checks (#6012 )	2020-09-03 09:42:49 +02:00
Adriane Boyd	8b5594df86	Remove near-duplicate test	2020-09-02 20:32:01 +02:00
Matthew Honnibal	122cb02001	Fix averages	2020-09-02 19:37:43 +02:00
Adriane Boyd	960d9cfadc	Officially support DependencyMatcher Add official support for the `DependencyMatcher`. Redesign the pattern specification. Fix and extend operator implementations. Update API docs and add usage docs. Patterns -------- Refactor pattern structure to: ``` { "LEFT_ID": str, "REL_OP": str, "RIGHT_ID": str, "RIGHT_ATTRS": dict, } ``` The first node contains only `RIGHT_ID` and `RIGHT_ATTRS` and all subsequent nodes contain all four keys. New operators ------------- Because of the way patterns are constructed from left to right, it's helpful to have `follows` operators along with `precedes` operators. Add operators for simple precedes / follows alongside immediate precedes / follows. * `.`: precedes `;`: immediately follows * `;`: follows Operator fixes -------------- `<` and `<<` do not include the node itself * Fix reversed order for all operators involving linear precedence (`.`, all sibling operators) * Linear precedence operators do not match nodes outside the same parse Additional fixes ---------------- * Use v3 Matcher API * Support `get` and `remove` * Support pickling	2020-09-02 17:45:29 +02:00
Marek Grzenkowicz	92d7832a86	Fix off-by-one error for best iteration calculation (closes #6014 ) (#6016 )	2020-09-02 15:15:45 +02:00
Matthew Honnibal	737a1408d9	Improve implementation of fix #6010 Follow-ups to the parser efficiency fix. * Avoid introducing new counter for number of pushes * Base cut on number of transitions, keeping it more even * Reintroduce the randomization we had in v2.	2020-09-02 14:42:32 +02:00
Sofie Van Landeghem	eb56377799	Fix overfitting test (#6011 ) * remove unused MORPH_RULES * fix textcat architecture in overfitting test	2020-09-02 13:07:41 +02:00
Adriane Boyd	b97d98783a	Fix Hungarian % tokenization (#6013 )	2020-09-02 13:06:16 +02:00
Matthew Honnibal	c1bf3a5602	Fix significant performance bug in parser training (#6010 ) The parser training makes use of a trick for long documents, where we use the oracle to cut up the document into sections, so that we can have batch items in the middle of a document. For instance, if we have one document of 600 words, we might make 6 states, starting at words 0, 100, 200, 300, 400 and 500. The problem is for v3, I screwed this up and didn't stop parsing! So instead of a batch of [100, 100, 100, 100, 100, 100], we'd have a batch of [600, 500, 400, 300, 200, 100]. Oops. The implementation here could probably be improved, it's annoying to have this extra variable in the state. But this'll do. This makes the v3 parser training 5-10 times faster, depending on document lengths. This problem wasn't in v2.	2020-09-02 12:57:13 +02:00
Sofie Van Landeghem	f7a25d69f7	Bugfix in merge_entities (#6005 ) * failing test * bugfix	2020-09-01 21:57:52 +02:00
Sofie Van Landeghem	6bfb1b3a29	Fix sparse checkout for 'spacy project' (#6008 ) * exit if cloning fails * UX * rewrite http link to git protocol, don't use stdin * fixes to sparse checkout * formatting	2020-09-01 19:49:01 +02:00
Matthew Honnibal	4cce32f090	Fix tagger initialization	2020-09-01 16:38:34 +02:00
Matthew Honnibal	046c38bd26	Remove 'cleanup' of strings (#6007 ) A long time ago we went to some trouble to try to clean up "unused" strings, to avoid the `StringStore` growing in long-running processes. This never really worked reliably, and I think it was a really wrong approach. It's much better to let the user reload the `nlp` object as necessary, now that the string encoding is stable (in v1, the string IDs were sequential integers, making reloading the NLP object really annoying.) The extra book-keeping does make some performance difference, and the feature is unsed, so it's past time we killed it.	2020-09-01 16:12:15 +02:00
Ines Montani	70b226f69d	Support ignore marker in project document [ci skip]	2020-09-01 12:49:04 +02:00
Ines Montani	a4c51f0f18	Add v3 info to project docs [ci skip]	2020-09-01 12:36:21 +02:00
Ines Montani	ef9005273b	Update fill-config command and add silent mode [ci skip]	2020-09-01 12:07:04 +02:00
Matthew Honnibal	ec660e3131	Fix use_pytorch_for_gpu_memory	2020-09-01 00:41:38 +02:00
Adriane Boyd	9130094199	Prevent Tagger model init with 0 labels (#5984 ) * Prevent Tagger model init with 0 labels Raise an error before trying to initialize a tagger model with 0 labels. * Add dummy tagger label for test * Remove tagless tagger model initializiation * Fix error number after merge * Add dummy tagger label to test * Fix formatting Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-08-31 21:24:33 +02:00
Matthw Honnibal	c38298b8fa	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2020-08-31 19:55:55 +02:00
Matthw Honnibal	fe298fa50a	Shuffle on first epoch of train	2020-08-31 19:55:22 +02:00
Ines Montani	9af82f3f11	Merge pull request #6003 from explosion/feature/matcher-as-spans	2020-08-31 17:50:56 +02:00
Ines Montani	add9de5487	Deprecate (Phrase)Matcher.pipe	2020-08-31 17:01:24 +02:00
Ines Montani	83aff38c59	Make argument keyword-only Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-08-31 15:39:03 +02:00
Ines Montani	6340d1c63d	Add as_spans to Matcher/PhraseMatcher	2020-08-31 14:53:22 +02:00
svlandeg	13ee742fb4	example of custom logger	2020-08-31 14:24:41 +02:00
svlandeg	c18eb63483	Merge remote-tracking branch 'upstream/develop' into feature/vectors-docs # Conflicts: # website/docs/usage/embeddings-transformers.md	2020-08-31 13:21:36 +02:00
Sofie Van Landeghem	ec14744ee4	Rename Transformer listener (#6001 ) * rename to spacy-transformers.TransformerListener * add some more tok2vec tests * use select_pipes * fix docs - annotation setter was not changed in the end	2020-08-31 12:41:39 +02:00
Adriane Boyd	216efaf5f5	Restrict tokenizer exceptions to ORTH and NORM	2020-08-31 09:55:01 +02:00
Matthew Honnibal	9341cbc013	Set version to v3.0.0a13	2020-08-30 23:10:43 +02:00
Ines Montani	45f46a5c85	Merge pull request #5993 from explosion/feature/disabled-components	2020-08-29 15:58:41 +02:00
Ines Montani	34146750d4	Use frozen list with custom errors We don't want to break backwards compatibility too much but we also want to provide the best possible UX	2020-08-29 15:20:11 +02:00
Ines Montani	744f432420	Merge pull request #5994 from explosion/feature/idempotent-component-decorator	2020-08-29 13:17:13 +02:00
Ines Montani	5de3f8604d	Update spacy/util.py Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-08-29 13:17:06 +02:00
Ines Montani	091a9b522a	Remove unused variable [ci skip]	2020-08-29 13:11:26 +02:00
Ines Montani	2bc31e15c9	Tidy up and auto-format [ci skip]	2020-08-29 13:01:10 +02:00
Ines Montani	6520d1a1df	Work around set order in Language.disabled	2020-08-29 12:58:22 +02:00
Ines Montani	f45095a666	Merge pull request #5995 from adrianeboyd/bugfix/attribute-ruler-bugfixes	2020-08-29 12:38:30 +02:00
Ines Montani	e0b4984aa4	Make deprecated disable_pipes call into select_pipes	2020-08-29 12:08:46 +02:00
Ines Montani	15d73f4dc3	Make user-facing Language.disabled return list More consistent with all the other properties	2020-08-29 12:08:33 +02:00
Matthew Honnibal	58f19421b1	Return empty batch from tok2vec listener if no doc.tensor	2020-08-29 03:46:50 +02:00
svlandeg	5230529de2	add loggers registry & logger docs sections	2020-08-28 21:44:04 +02:00
Ines Montani	0687d7148e	Rename user-facing API	2020-08-28 21:04:02 +02:00
Adriane Boyd	0104bd1600	Sort the AttributeRuler matches by rule order Sort the returned matches by rule order (the `match_id`) so that the rules are applied in the order they were added. This is necessary, for instance, if the `AttributeRuler` is used for the tag map and later rules require POS tags.	2020-08-28 21:01:06 +02:00
Ines Montani	6a999c9303	Remove outdated component attr check	2020-08-28 20:59:19 +02:00
Adriane Boyd	8674b17651	Serialize AttributeRuler.patterns Serialize `AttributeRuler.patterns` instead of the individual lists to simplify the serialized and so that patterns are reloaded exactly as they were originally provided (preserving `_attrs_unnormed`).	2020-08-28 20:44:45 +02:00
Ines Montani	10da74382f	Raise if disabled components are removed before DisabledPipes.restore	2020-08-28 20:35:26 +02:00
Ines Montani	1e0363290e	Remove todos and update docstrings	2020-08-28 20:34:46 +02:00
Ines Montani	cad988da7f	Allow component decorators to re-run with same function	2020-08-28 16:27:22 +02:00
Ines Montani	3ce5be4b76	Allow loaded but disabled components	2020-08-28 15:20:14 +02:00
Ines Montani	89f692bc8a	Merge pull request #5992 from svlandeg/feature/wandb-restrict-config	2020-08-28 15:05:29 +02:00
Ines Montani	9c4049b57f	Merge pull request #5986 from explosion/fix/language-config-interpolate-disk-bytes	2020-08-28 15:03:52 +02:00
Ines Montani	adc050cdc5	Fix code style in test [ci skip]	2020-08-28 15:03:21 +02:00
svlandeg	05a1bafa15	fix type	2020-08-28 14:08:33 +02:00
svlandeg	33883aa764	rename field	2020-08-28 14:06:23 +02:00
svlandeg	1d8c4070aa	add disable_fields to wandb_logger	2020-08-28 13:55:32 +02:00
Ines Montani	a51b4f3a19	Merge branch 'develop' into fix/language-config-interpolate-disk-bytes	2020-08-28 13:21:17 +02:00
Ines Montani	03dde511b4	Merge pull request #5987 from explosion/feature/debug-config [ci skip]	2020-08-28 11:30:18 +02:00
Ines Montani	62e9967228	Merge branch 'develop' into fix/language-config-interpolate-disk-bytes	2020-08-28 11:19:36 +02:00
Ines Montani	4ca2698f85	Merge branch 'develop' into feature/debug-config	2020-08-28 11:19:17 +02:00
svlandeg	9a8255ffd5	two tests because of different exit type	2020-08-28 10:50:26 +02:00
svlandeg	73baaf330a	update error type	2020-08-28 10:46:21 +02:00
Matthew Honnibal	c558ca4485	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2020-08-27 19:47:26 +02:00
Matthew Honnibal	d3ffe4ca63	Fix error when tagger was initialized with no labels	2020-08-27 18:56:58 +02:00
Ines Montani	d1780db6a4	Tidy up and use different error [ci skip]	2020-08-27 18:56:55 +02:00
Ines Montani	ff4175e839	Add more info to debug config	2020-08-27 18:17:58 +02:00
Ines Montani	daac8ebacd	Don't interpolate config on Language deserialization	2020-08-27 16:44:36 +02:00
Matthew Honnibal	e1e1760fd6	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2020-08-27 03:22:11 +02:00
Matthew Honnibal	95adb58f15	Force tagger to pass batch of docs into model in begin_training	2020-08-27 03:21:03 +02:00
Ines Montani	cdc114e212	Merge pull request #5977 from explosion/refactor/vector-names	2020-08-26 19:03:16 +02:00
Ines Montani	8692d176f6	Merge pull request #5978 from explosion/feature/update-wasabi Update wasabi: new diff_strings and MarkdownRenderer	2020-08-26 19:02:52 +02:00
Matthew Honnibal	9b22714a4e	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2020-08-26 15:48:45 +02:00
Matthew Honnibal	172af24f95	Fix upload and download	2020-08-26 15:48:23 +02:00
Ines Montani	a5fff1df51	Remove outdated non-empty output dir warning [ci skip]	2020-08-26 15:45:51 +02:00
Matthew Honnibal	2d520d3b45	Remove unused error	2020-08-26 15:41:14 +02:00
Adriane Boyd	90d88729e0	Add AttributeRuler.score (#5963 ) * Add AttributeRuler.score Add scoring for TAG / POS / MORPH / LEMMA if these are present in the assigned token attributes. Add default score weights (that don't really make a lot of sense) so that the scores are in the default config in some form. * Update docs	2020-08-26 15:39:30 +02:00
Ines Montani	3aec98ca38	Update wasabi: new diff_strings and MarkdownRenderer	2020-08-26 15:33:11 +02:00
Sofie Van Landeghem	79d460e3a2	Weights & Biases logger for train CLI (#5971 ) * quick test as part of train script * train_logger in config, default ConsoleLogger in loggers catalogue * entitiy typo * add wandb_logger * cleanup * Update spacy/cli/train_logger.py Co-authored-by: Ines Montani <ines@ines.io> * move loggers to gold.loggers Co-authored-by: Ines Montani <ines@ines.io>	2020-08-26 15:24:33 +02:00
Ines Montani	0997c30b9e	Merge pull request #5974 from explosion/feature/project-document	2020-08-26 15:14:13 +02:00
Matthew Honnibal	191fb4144f	Merge branch 'develop' into refactor/vector-names	2020-08-26 14:26:45 +02:00
Ines Montani	627617a079	Tidy up and add docs [ci skip]	2020-08-26 13:24:55 +02:00
Adriane Boyd	43c61da209	Set macro AUC score in Scorer.score_cats	2020-08-26 10:49:30 +02:00
Ines Montani	aeebc6678d	Small cleanup and adjustments	2020-08-26 10:26:57 +02:00
Ines Montani	31567d1e42	Link project.yml	2020-08-26 10:26:32 +02:00
Ines Montani	6c2a5ff53b	Auto-link local sources	2020-08-26 10:26:06 +02:00
Matthew Honnibal	77852d2428	Fix run_command for python 3.6	2020-08-26 05:02:43 +02:00
Matthew Honnibal	884cac5fb5	Make run_command backwards compatible	2020-08-26 04:33:42 +02:00
Matthew Honnibal	6547472347	Set version to v3.0.0a12	2020-08-26 04:02:34 +02:00
Adriane Boyd	7d7b65ffd4	Fix raw strings in URL pattern (#5972 ) Add missing raw string specifiers.	2020-08-26 04:00:49 +02:00
Matthew Honnibal	2771e4f2b3	Fix the git "sparse checkout" functionality (#5973 ) * Fix the git sparse checkout functionality * Format	2020-08-26 04:00:14 +02:00
Ines Montani	1c958a76c1	Add comment markers to only replace auto-generated docs	2020-08-26 00:03:06 +02:00
Ines Montani	f10989e8c4	Add "project document" and more project.yml meta fields	2020-08-25 17:14:27 +02:00
Ines Montani	fdcaf86c54	Adjust docstring End sentence earlier so it's shown as a full sentence in --help	2020-08-25 17:13:50 +02:00
Ines Montani	b89f6fa011	Fix meta defaults and error in package command	2020-08-25 17:13:33 +02:00
Ines Montani	94705c21c8	Allow reuse on validators to prevent reload error Otherwise this will cause an error if spaCy is live reloaded, e.g. in Streamlit	2020-08-25 17:13:11 +02:00
Matthew Honnibal	4f82a02b70	Remove 'fix_pretrained_vectors_name' hack	2020-08-25 14:37:45 +02:00
Adriane Boyd	0bab7c8b91	Remove PRON_LEMMA symbol (#5968 )	2020-08-25 14:21:29 +02:00
Hiroshi Matsuda	332803eda9	fix ja leading spaces (#5969 ) * change condition for space after * add NAUGHTY_STRINGS test example	2020-08-25 14:16:24 +02:00
Ines Montani	dd84577a98	Update CLI utils, project.yml schema and add test	2020-08-25 11:54:53 +02:00
Shashank	450720aca2	Added support for Sanskrit language (#5956 ) * Added support for Sanskrit language * Added tests for lexical attribute like_num	2020-08-25 10:56:29 +02:00
Matthew Honnibal	ef43152af4	Update scorer	2020-08-25 02:42:47 +02:00
Matthew Honnibal	8d6e1ce306	Update v3.0.0a11	2020-08-25 00:32:08 +02:00
Matthew Honnibal	8038b87f04	Various small tweaks to project CLI (#5965 ) * Fix up/download of http and local paths * Support git_sparse_checkout for assets * Fix scorer * Handle already-present directories for git assets * Improve convert command * Fix support for existant files in git assets * Support branches in git sparse checkout * Format * Fix git assets * Document git block in assets * Fix test * Fix test * Revert "Fix test" This reverts commit `cf3097260f`. * Revert "Fix test" This reverts commit `964d636e27`. * Dont multiply p/r/f by 100 * Display scores * 100 during training	2020-08-25 00:30:52 +02:00
Adriane Boyd	abd3f2b65a	Rename Polish lemmatizer method (#5960 ) Rename Polish lemmatizer method to `pos_lookup` to distinguish it from pure token-based lookup methods.	2020-08-25 00:22:27 +02:00
Ines Montani	e12b03358b	Support removing extra values in fill-config (#5966 ) * Support removing extra values in fill-config * Fix test	2020-08-24 22:53:47 +02:00
Matthew Honnibal	f232d8db96	Report p/r/f out of 100	2020-08-24 17:17:23 +02:00
Ines Montani	0e7f99da58	Fix handling of optional [pretraining] block (#5954 ) * Fix handling of optional [pretraining] block * Remote pretraining from default config * Fix test * Add schema option for empty pretrain block	2020-08-24 15:56:03 +02:00
idoshr	b10c7bc56e	Hebrew like num (#5952 ) * Update stop_words.py Hebrew STOP WORDS * Update stop_words.py * contributor * contributor * add some common domain extentions support human number 1K/1M.... * support human number 1K/1M.... * hebrew number tokenize 1K/1M implement in EN * test human tokenize fix * test * heb like num revert human number change * heb like num	2020-08-24 14:30:05 +02:00
Matthew Honnibal	64df37643f	Update lockfile after project pull	2020-08-24 03:27:09 +02:00
Matthew Honnibal	588c28fe45	Fix project pull when deps missing	2020-08-24 01:23:36 +02:00
Matthew Honnibal	001546c19e	Set version to v3.0.0a10	2020-08-23 21:15:38 +02:00
Matthew Honnibal	160a855246	Format	2020-08-23 21:15:12 +02:00
Matthew Honnibal	89f5b8abb3	Fix project push	2020-08-23 21:14:44 +02:00
Matthew Honnibal	3828bc3ed0	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2020-08-23 18:32:24 +02:00
Matthew Honnibal	e559867605	Allow spacy project to push and pull to/from remote storage (#5949 ) * Add utils for working with remote storage * WIP add remote_cache for project * WIP add push and pull commands * Use pathy in remote_cache * Updarte util * Update remote_cache * Update util * Update project assets * Update pull script * Update push script * Fix type annotation in util * Work on remote storage * Remove site and env hash * Fix imports * Fix type annotation * Require pathy * Require pathy * Fix import * Add a util to handle project variable substitution * Import push and pull commands * Fix pull command * Fix push command * Fix tarfile in remote_storage * Improve printing * Fiddle with status messages * Set version to v3.0.0a9 * Draft docs for spacy project remote storages * Update docs [ci skip] * Use Thinc config to simplify and unify template variables * Auto-format * Don't import Pathy globally for now Causes slow and annoying Google Cloud warning * Tidy up test * Tidy up and update tests * Update to latest Thinc * Update docs * variables -> vars * Update docs [ci skip] * Update docs [ci skip] Co-authored-by: Ines Montani <ines@ines.io>	2020-08-23 18:32:09 +02:00
Matthew Honnibal	fe1cf7e124	Allow score_weights to list extra scores	2020-08-23 18:31:30 +02:00
Ines Montani	9bdc9e81f5	Fix error message [ci skip]	2020-08-23 12:14:02 +02:00
Sofie Van Landeghem	56eabcb2f2	Adding num_like test for Czech (#5946 ) * Create lex_attrs.py Hello, I am missing a CZECH language in SpaCy. So I would like to help to push it a little. This file is base on others lex_attrs.py files just with translation to Czech. * Update __init__.py Updated for use with new Czech Lex_attrs file * Update stop_words.py * Create test_text.py * add like_num testing for czech Co-authored-by: holubvl3 <47881982+holubvl3@users.noreply.github.com> Co-authored-by: holubvl3 <vilemrousi@gmail.com> Co-authored-by: Vladimír Holubec <vholubec@arcdata.cz>	2020-08-21 17:06:33 +02:00
holubvl3	a341b4ef09	Adding support for Czech language (#5826 ) * Create lex_attrs.py Hello, I am missing a CZECH language in SpaCy. So I would like to help to push it a little. This file is base on others lex_attrs.py files just with translation to Czech. * Update __init__.py Updated for use with new Czech Lex_attrs file * Update stop_words.py * Create test_text.py Co-authored-by: Vladimír Holubec <vholubec@arcdata.cz>	2020-08-21 16:17:53 +02:00
svlandeg	af36d77d01	fix typo in docstring	2020-08-21 15:56:03 +02:00
svlandeg	3060e4ae65	Merge remote-tracking branch 'upstream/develop' into feature/docs-docs-docs # Conflicts: # website/src/widgets/quickstart-training-generator.js	2020-08-21 15:16:30 +02:00
svlandeg	cc926267f8	small fixes	2020-08-21 15:05:40 +02:00
Ines Montani	aa6a7cd6e7	Update docs and consistency [ci skip]	2020-08-21 13:49:18 +02:00
Ines Montani	3826cfb8fe	Merge pull request #5930 from svlandeg/feature/init-config-fix UX for init config	2020-08-21 12:06:33 +02:00
Ines Montani	79af7dcd6d	Small wording adjustments [ci skip]	2020-08-21 12:06:19 +02:00
Ines Montani	e60442d83a	Adjust label casing in displaCy NER visualizer (resolves #4866 ) - Accept any case for label names in ents and colors option, even if actual predicted label uses different casing - Don't text-transform: uppercase visually, if it's important to users that the label is represented as-is in the UI	2020-08-21 11:51:31 +02:00
Matthew Honnibal	c356e62908	Minor adjustments to quickstart template	2020-08-21 00:10:21 +02:00
Ines Montani	6ad59d59fe	Merge branch 'develop' of https://github.com/explosion/spaCy into develop [ci skip]	2020-08-20 11:20:58 +02:00
Sofie Van Landeghem	071c09ff35	add coding (#5942 )	2020-08-20 11:08:38 +02:00
Ines Montani	ea6640ea72	Merge pull request #5939 from explosion/feature/thinc-v8.0.0a28 Update Thinc and config variables	2020-08-19 21:14:36 +02:00
Ines Montani	3dd390b1a1	Update Thinc and config variables	2020-08-19 19:46:12 +02:00
svlandeg	b96cd9fa5e	fix typo	2020-08-19 18:46:08 +02:00
Ines Montani	e2f2ef3a5a	Update init config and recommendations - As much as I dislike YAML, it seemed like a better format here because it allows us to add comments if we want to explain the different recommendations - Don't include the generated JS in the repo by default and build it on the fly when running or deploying the site. This ensures it's always up to date. - Simplify jinja_to_js script and use fewer dependencies	2020-08-19 13:33:15 +02:00
Ines Montani	2285e59765	Merge pull request #5933 from svlandeg/feature/more-v3-docs [ci skip]	2020-08-19 11:29:02 +02:00
Matthew Honnibal	c0f6e77a41	Set version to v3.0.0a8	2020-08-18 23:29:00 +02:00
svlandeg	a8acedd4ba	example of custom reader and batcher	2020-08-18 19:15:16 +02:00
Sofie Van Landeghem	358cbb21e3	Define candidate generator in EL config (#5876 ) * candidate generator as separate part of EL config * update comment * ent instead of str as input for candidate generation * Span instead of str: correct type indication * fix types * unit test to create new candidate generator * fix replace_pipe argument passing * move error message, general cleanup * add vocab back to KB constructor * provide KB as callable from Vocab arg * rename to kb_loader, fix KB serialization as part of the EL pipe * fix typo * reformatting * cleanup * fix comment * fix wrongly duplicated code from merge conflict * rename dump to to_disk * from_disk instead of load_bulk * update test after recent removal of set_morphology in tagger * remove old doc	2020-08-18 16:10:36 +02:00
Sofie Van Landeghem	688e77562b	Train CLI script fixes (#5931 ) * fix dash replacement in overrides arguments * perform interpolation on training config * make sure only .spacy files are read	2020-08-18 16:06:37 +02:00
Ines Montani	82f0e20318	Update docs and consistency [ci skip]	2020-08-18 14:39:40 +02:00
svlandeg	10e67b400c	output_file required, spacy-transformers prefered instead of required	2020-08-18 13:38:43 +02:00
Ines Montani	1c3bcfb488	Update docs and util consistency	2020-08-18 01:22:59 +02:00
Ines Montani	990c6b4c32	Update docs and CLI [ci skip]	2020-08-17 21:38:20 +02:00
Ines Montani	3ae5e02f4f	Update docs, types and API consistency	2020-08-17 16:45:24 +02:00
Matthew Honnibal	a95a36ce2a	Set version to v3.0.0a7	2020-08-16 15:51:05 +02:00
Ines Montani	6ae83bde0c	Fix CLI consistency [ci skip]	2020-08-16 15:46:29 +02:00
Ines Montani	45f13cbf64	Merge pull request #5916 from explosion/feature/new-thinc-config	2020-08-16 15:24:12 +02:00
Ines Montani	34bda91695	Show warnings if there's nothing to auto-fill	2020-08-16 14:19:43 +02:00
Ines Montani	dd5804d499	Update type hints	2020-08-16 14:19:33 +02:00
Ines Montani	a570c304df	Update quickstart, template and docs	2020-08-15 14:50:29 +02:00
Ines Montani	3272a63430	Merge pull request #5920 from explosion/fix/logging-warning-various	2020-08-15 14:41:15 +02:00
Ines Montani	fdcde9b0bf	Add init fill-config	2020-08-14 16:49:26 +02:00
Matthew Honnibal	9ebf39fb5f	Relax test	2020-08-14 16:31:09 +02:00
Ines Montani	8128e5eb35	Replace lexeme_norm warning with logging	2020-08-14 15:00:52 +02:00
Ines Montani	37814b608d	Remove env_opt and simplfy default Optimizer	2020-08-14 14:59:54 +02:00
Ines Montani	ab1d165bba	Pass optimizer defined in config to resume/begin_training Otherwise, this would create a default optimizer, which isn't what we want?	2020-08-14 14:59:22 +02:00
Ines Montani	e4d0990857	Only receive from listener if listener exists	2020-08-14 14:58:48 +02:00
Ines Montani	cef97e4b63	Fix path check	2020-08-14 14:58:18 +02:00
Ines Montani	db2dbc8e59	Remove unused warning	2020-08-14 14:58:03 +02:00
Ines Montani	67cc39af7f	Update Thinc and include section order	2020-08-14 14:06:22 +02:00
Ines Montani	88b0a96801	Update for new Thinc and adjust config	2020-08-13 17:38:30 +02:00
Adam Bittlingmayer	7b33b2854f	Add Armenian sentence-final verchaket, Greek question mark and Arabic question mark to default punct (#5910 ) * Add Armenian sentence-final verchaket * Add Greek and Arabic question marks, and contributor agreement * Check box	2020-08-12 15:36:14 +02:00
graue70	49e690bde1	Fix typos in comments (#5904 ) * Fix typo in comment * Fix typo * Add spaCy Contributor Agreement	2020-08-12 15:35:25 +02:00
graue70	ba84371ab0	Use init parameter (#5909 )	2020-08-11 23:41:58 +02:00
Ines Montani	950832f087	Tidy up pipes (#5906 ) * Tidy up pipes * Fix init, defaults and raise custom errors * Update docs * Update docs [ci skip] * Apply suggestions from code review Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com> * Tidy up error handling and validation, fix consistency * Simplify get_examples check * Remove unused import [ci skip] Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-08-11 23:29:31 +02:00
Ines Montani	f79e4c094d	Remove generic type Seems to cause error on Python 3.8 with Cython?	2020-08-10 17:24:30 +02:00
Ines Montani	c099f6eece	Add Token.lex	2020-08-10 16:43:52 +02:00
Ines Montani	933a7cf8d1	Fix Lexeme.from_ptr	2020-08-10 16:43:37 +02:00
Ines Montani	64f2f84098	Update docstrings and docs [ci skip]	2020-08-10 13:45:22 +02:00
Ines Montani	a4b448eec4	Remove unused compiler flag	2020-08-10 13:13:18 +02:00
Ines Montani	3eaeb73342	Tidy up and auto-format	2020-08-09 22:36:23 +02:00
Ines Montani	d5c78c7a34	Update docs and fix consistency	2020-08-09 22:31:52 +02:00
Ines Montani	7c6854d8d4	Fix missing imports	2020-08-09 22:28:29 +02:00
Matthew Honnibal	0fc13b2f14	Set version to v3.0.0a6	2020-08-09 21:53:32 +02:00
Ines Montani	a15c5fb191	Update docstrings and docs	2020-08-09 16:10:48 +02:00
Ines Montani	8d2baa153d	Update tokenizer docs and add test	2020-08-09 15:24:01 +02:00
Matthew Honnibal	134d933d67	Add docstring for entity linker factory	2020-08-09 15:19:28 +02:00
Matthew Honnibal	992ee1c02f	Update tagger docstring	2020-08-09 15:09:31 +02:00
Matthew Honnibal	ebf9a7acbf	Add textcat docstring	2020-08-09 15:07:09 +02:00
Matthew Honnibal	8a13f510d6	Update tests	2020-08-09 15:01:16 +02:00
Matthew Honnibal	bbd8acd4bf	Add docstrings for parser and NER. Simplify some arguments	2020-08-09 14:46:13 +02:00
Matthew Honnibal	39a3d64c01	Add docstrings for Tok2Vec component	2020-08-09 00:48:03 +02:00
Ines Montani	fd20f84927	Merge pull request #5895 from explosion/docs/batchers Draft docstrings for batchers	2020-08-07 20:07:10 +02:00
Matthew Honnibal	f5c4e0b751	Add docstrings for batchers	2020-08-07 18:51:02 +02:00
Ines Montani	fe29ceec9e	Merge branch 'develop' into docs/model-docstrings	2020-08-07 18:42:01 +02:00
Ines Montani	3a193eb8f1	Fix imports, types and default configs	2020-08-07 18:40:54 +02:00
Matthew Honnibal	b1d83fc13e	Fix imports	2020-08-07 16:55:54 +02:00
Matthew Honnibal	473504d837	Format	2020-08-07 16:49:00 +02:00
Matthew Honnibal	234c52a91e	Add tok2vec docstrings	2020-08-07 16:48:48 +02:00
Matthew Honnibal	547bc8a82b	Add docstring notes	2020-08-07 16:17:34 +02:00
Ines Montani	6f3649923c	Merge pull request #5893 from explosion/feature/validate-arg	2020-08-07 15:47:20 +02:00
Adriane Boyd	e962784531	Add Lemmatizer and simplify related components (#5848 ) * Add Lemmatizer and simplify related components * Add `Lemmatizer` pipe with `lookup` and `rule` modes using the `Lookups` tables. * Reduce `Tagger` to a simple tagger that sets `Token.tag` (no pos or lemma) * Reduce `Morphology` to only keep track of morph tags (no tag map, lemmatizer, or morph rules) * Remove lemmatizer from `Vocab` * Adjust many many tests Differences: * No default lookup lemmas * No special treatment of TAG in `from_array` and similar required * Easier to modify labels in a `Tagger` * No extra strings added from morphology / tag map * Fix test * Initial fix for Lemmatizer config/serialization * Adjust init test to be more generic * Adjust init test to force empty Lookups * Add simple cache to rule-based lemmatizer * Convert language-specific lemmatizers Convert language-specific lemmatizers to component lemmatizers. Remove previous lemmatizer class. * Fix French and Polish lemmatizers * Remove outdated UPOS conversions * Update Russian lemmatizer init in tests * Add minimal init/run tests for custom lemmatizers * Add option to overwrite existing lemmas * Update mode setting, lookup loading, and caching * Make `mode` an immutable property * Only enforce strict `load_lookups` for known supported modes * Move caching into individual `_lemmatize` methods * Implement strict when lang is not found in lookups * Fix tables/lookups in make_lemmatizer * Reallow provided lookups and allow for stricter checks * Add lookups asset to all Lemmatizer pipe tests * Rename lookups in lemmatizer init test * Clean up merge * Refactor lookup table loading * Add helper from `load_lemmatizer_lookups` that loads required and optional lookups tables based on settings provided by a config. Additional slight refactor of lookups: * Add `Lookups.set_table` to set a table from a provided `Table` * Reorder class definitions to be able to specify type as `Table` * Move registry assets into test methods * Refactor lookups tables config Use class methods within `Lemmatizer` to provide the config for particular modes and to load the lookups from a config. * Add pipe and score to lemmatizer * Simplify Tagger.score * Add missing import * Clean up imports and auto-format * Remove unused kwarg * Tidy up and auto-format * Update docstrings for Lemmatizer Update docstrings for Lemmatizer. Additionally modify `is_base_form` API to take `Token` instead of individual features. * Update docstrings * Remove tag map values from Tagger.add_label * Update API docs * Fix relative link in Lemmatizer API docs	2020-08-07 15:27:13 +02:00
Matthew Honnibal	da6e59519e	Add docstrings for simple_ner	2020-08-07 15:09:49 +02:00
Matthew Honnibal	7ef8a64df9	Add docstring for parser	2020-08-07 14:59:34 +02:00
Ines Montani	fc9a4fe827	Update attribute ruler	2020-08-07 14:43:55 +02:00
Ines Montani	a8404c3517	validation -> validate	2020-08-07 14:43:47 +02:00
Ines Montani	1d01d89b79	Update CLI docs and evaluate command [ci skip]	2020-08-07 14:40:58 +02:00
Ines Montani	ef2c67cca5	Add DocBin to/from_disk methods and update docs (#5892 ) * Add DocBin to/from_disk methods and update docs * Use DocBin.from_disk in Corpus	2020-08-07 14:30:59 +02:00
Ines Montani	4ca08c6d5d	Merge pull request #5891 from adrianeboyd/docs/attribute-ruler-api Add AttributeRuler API docs	2020-08-07 13:55:12 +02:00
Adriane Boyd	b8d0c23857	Add AttributeRuler API docs With additional minor updates to AttributeRuler docstrings.	2020-08-07 12:43:23 +02:00
svlandeg	b17db0e994	Merge remote-tracking branch 'upstream/develop' into feature/el-docs # Conflicts: # website/docs/usage/training.md	2020-08-06 19:48:52 +02:00
Adriane Boyd	06c3a5e048	Add pipe to AttributeRuler (#5889 )	2020-08-06 19:43:09 +02:00
Ines Montani	9b7f198390	Fix format	2020-08-06 19:30:53 +02:00
Ines Montani	3c4389110d	Remove unused imports	2020-08-06 19:30:47 +02:00
Matthew Honnibal	d4525816ef	Be less choosy about reporting textcat scores (#5879 ) * Set textcat scores more consistently * Refactor textcat scores * Fixes to scorer * Add comments * Add threshold * Rename just 'f' to micro_f in textcat scorer * Fix textcat score for two-class * Fix syntax * Fix textcat score * Fix docstring	2020-08-06 16:24:13 +02:00
svlandeg	0b4d1e1bc4	'debug data' instead of 'debug-data'	2020-08-06 15:47:31 +02:00
svlandeg	881e3f8fd0	add docbin explanation and example	2020-08-06 15:29:44 +02:00
Adriane Boyd	5e683a6e46	Fix return values for per feat score (#5885 ) * Fix return values for per feat score Convert `PRFScore` to dict as other per type scores. * Update tests accordingly	2020-08-06 15:14:47 +02:00
Ines Montani	913d21f0a3	Merge pull request #5882 from explosion/feature/raise-from Use "raise ... from" in custom errors for better tracebacks	2020-08-06 00:35:26 +02:00
Ines Montani	06e80d95cd	Sync develop with nightly docs state (#5883 ) Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>	2020-08-06 00:28:14 +02:00
Ines Montani	d92954ac1d	Merge pull request #5881 from explosion/feature/better-error-model-shortcuts	2020-08-06 00:13:35 +02:00
Ines Montani	56c17973aa	Use "raise ... from" in custom errors for better tracebacks	2020-08-05 23:53:21 +02:00
Ines Montani	5cc0d89fad	Simplify config overrides in CLI and deserialization (#5880 )	2020-08-05 23:35:09 +02:00
Ines Montani	0881455a5d	Update error message	2020-08-05 23:15:05 +02:00
Ines Montani	2a1fa86a0d	Add better error for failed model shortcut loading	2020-08-05 23:10:29 +02:00
Ines Montani	c675746ca2	Update docstrings and types	2020-08-05 20:29:46 +02:00
Ines Montani	823e533dc1	Add config callbacks for modifying nlp object before and after init (#5866 ) * WIP: Concept for modifying nlp object before and after init * Make callbacks return nlp object Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com> * Raise if callbacks don't return correct type * Rename, update types, add after_pipeline_creation Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-08-05 19:47:54 +02:00
Ines Montani	586d695775	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2020-08-05 16:01:11 +02:00
Ines Montani	e68459296d	Tidy up and auto-format	2020-08-05 16:00:59 +02:00
Matthew Honnibal	50c0e49741	Fix train CLI	2020-08-05 15:40:47 +02:00
Matthew Honnibal	b9df4d6116	Fix textcat.begin_training if vectors set	2020-08-05 15:40:36 +02:00
Adriane Boyd	4193402c47	Add warning when Matcher subpattern is discarded (#5873 ) * Add a warning when a subpattern is not processed and discarded * Normalize subpattern attribute/operator keys to upper case like top-level attributes	2020-08-05 14:56:14 +02:00
Adriane Boyd	af125875cf	Update SimpleNER (#5878 ) * Fix `get_loss` to use NER annotation * Add labels as part of cfg * Add simple overfitting test	2020-08-05 14:43:29 +02:00
Sofie Van Landeghem	b88c5c701a	Bugfix in nlp.replace_pipe (#5875 ) * bugfix and unit test * merge two conditions	2020-08-05 09:30:58 +02:00
Ines Montani	b795f02fbd	Allow adding pipeline components from source model (#5857 ) * Allow adding pipeline components from source model * Config: name -> component * Improve error messages * Fix error and test * Add frozen components and exclude logic * Remove exclude from Language.evaluate * Init sourced components with current vocab * Fix error codes	2020-08-04 23:39:19 +02:00
Sofie Van Landeghem	34873c4911	Example Dict format consistency (#5858 ) * consistently use upper-case IDS in token_annotation format and for get_aligned * remove ID from to_dict (not used in from_dict either) * fix test Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-08-04 22:22:26 +02:00
Adriane Boyd	fa79a0db9f	Add AttributeRuler for token attribute exceptions (#5842 ) * Add AttributeRuler for token attribute exceptions Add the `AttributeRuler` to handle exceptions for token-level attributes. The `AttributeRuler` uses `Matcher` patterns to identify target spans and applies the specified attributes to the token at the provided index in the matched span. A negative index can be used to index from the end of the matched span. The retokenizer is used to "merge" the individual tokens and assign them the provided attributes. Helper functions can import existing tag maps and morph rules to the corresponding `Matcher` patterns. There is an additional minor bug fix for `MORPH` attributes in the retokenizer to correctly normalize the values and to handle `MORPH` alongside `_` in an attrs dict. * Fix default name * Update name in error message * Extend AttributeRuler functionality * Add option to initialize with a dict of AttributeRuler patterns * Instead of silently discarding overlapping matches (the default behavior for the retokenizer if only the attrs differ), split the matches into disjoint sets and retokenize each set separately. This allows, for instance, one pattern to set the POS and another pattern to set the lemma. (If two matches modify the same attribute, it looks like the attrs are applied in the order they were added, but it may not be deterministic?) * Improve types * Sort spans before processing * Fix index boundaries in Span * Refactor retokenizer to separate attrs methods Add top-level `normalize_token_attrs` and `set_token_attrs` methods. * Update AttributeRuler to use refactored methods Update `AttributeRuler` to replace use of full retokenizer with only the relevant methods for normalizing and setting attributes for a single token. * Update spacy/pipeline/attributeruler.py Co-authored-by: Ines Montani <ines@ines.io> * Make API more similar to EntityRuler * Add `AttributeRuler.add_patterns` to add patterns from a list of dicts * Return list of dicts as property `AttributeRuler.patterns` * Make attrs_unnormed private * Add test loading patterns from assets * Revert "Fix index boundaries in Span" This reverts commit `8f8a5c3386`. * Add Span index boundary checks (#5861) * Add Span index boundary checks * Return Span-specific IndexError in all cases * Simplify and fix if/else Co-authored-by: Ines Montani <ines@ines.io>	2020-08-04 17:02:39 +02:00
Sofie Van Landeghem	492d1ec5de	Prevent alignment when texts don't match (#5867 ) * remove empty gold.pyx * add alignment unit test (to be used in docs) * ensure that Alignment is only used on equal texts * additional test using example.alignment * formatting Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-08-04 16:29:18 +02:00
Matthew Honnibal	ecb3c4e8f4	Create corpus iterator and batcher from registry during training (#5865 ) * Move batchers into their own module (and registry) * Update CLI * Update Corpus and batcher * Update tests * Update one config * Merge 'evaluation' block back under [training] * Import batchers in gold __init__ * Fix batchers * Update config * Update schema * Update util * Don't assume train and dev are actually paths * Update onto-joint config * Fix missing import * Format * Format * Update spacy/gold/corpus.py Co-authored-by: Ines Montani <ines@ines.io> * Fix name * Update default config * Fix get_length option in batchers * Update test * Add comment * Pass path into Corpus * Update docstring * Update schema and configs * Update config * Fix test * Fix paths * Fix print * Fix create_train_batches * [training.read_train] -> [training.train_corpus] * Update onto-joint config Co-authored-by: Ines Montani <ines@ines.io>	2020-08-04 15:09:37 +02:00
Sofie Van Landeghem	82347110f5	Default empty KB in EL component (#5872 ) * EL field documentation * documentation consistent with docs * default empty KB, initialize vocab separately * formatting * add test for changing the default entity vector length * update comment	2020-08-04 14:34:09 +02:00
Adriane Boyd	b7e3018d97	Recalculate alignment if tokenization differs (#5868 ) * Recalculate alignment if tokenization differs * Refactor cached alignment data	2020-08-04 14:31:32 +02:00
Adriane Boyd	c62fd878a3	Allow Doc.char_span to snap to token boundaries (#5849 ) * Allow Doc.char_span to snap to token boundaries Add a `mode` option to allow `Doc.char_span` to snap to token boundaries. The `mode` options: * `strict`: character offsets must match token boundaries (default, same as before) * `inside`: all tokens completely within the character span * `outside`: all tokens at least partially covered by the character span Add a new helper function `token_by_char` that returns the token corresponding to a character position in the text. Update `token_by_start` and `token_by_end` to use `token_by_char` for more efficient searching. * Remove unused import * Rename mode to alignment_mode Rename `mode` to `alignment_mode` with the options `strict`/`contract`/`expand`. Any unrecognized modes are silently converted to `strict`.	2020-08-04 13:36:32 +02:00
Adriane Boyd	b841248589	Add Span index boundary checks (#5861 ) * Add Span index boundary checks * Return Span-specific IndexError in all cases * Simplify and fix if/else	2020-08-04 13:35:25 +02:00
Adriane Boyd	cd59979ab4	Fix span boundary handling in Spanish noun_chunks (#5860 )	2020-08-03 13:53:15 +02:00
Ines Montani	934447a611	Merge pull request #5855 from svlandeg/fix/cli-debug	2020-08-03 13:09:20 +02:00
Ines Montani	4c055f0aa7	Add init CLI and init config (#5854 ) * Add init CLI and init config draft * Improve config validation * Auto-format * Don't export anything in debug config * Update docs	2020-08-02 15:18:30 +02:00
svlandeg	6f4e46ee93	Merge remote-tracking branch 'upstream/develop' into fix/cli-debug # Conflicts: # pyproject.toml # requirements.txt # setup.cfg	2020-08-01 18:38:59 +02:00
Ines Montani	b40f44419b	Simplify pipe analysis - remove unused code - don't print by default - integrate attrs info into analysis output	2020-08-01 13:40:06 +02:00
Ines Montani	b68c53858c	Remove global	2020-07-31 18:37:58 +02:00
Ines Montani	30a76fcf6f	Integrate and simplify pipe analysis	2020-07-31 18:34:35 +02:00
svlandeg	9b719dfb1a	use divider inbetween steps	2020-07-31 18:06:48 +02:00
svlandeg	51ffc4a166	rename pipe_name to component	2020-07-31 17:58:55 +02:00
svlandeg	878327d38e	printing final predictions by default to False	2020-07-31 17:36:32 +02:00
Ines Montani	2d955fbf98	Fix linting [ci skip]	2020-07-31 17:05:28 +02:00
Ines Montani	e9e8fa2466	Update docs and types	2020-07-31 17:02:54 +02:00
svlandeg	cc2f58a1b0	use data_validation context manager	2020-07-31 16:49:42 +02:00
Adriane Boyd	ac14ce7c30	Prefer earlier spans in EntityRuler (#5843 ) Similar to #4414, update the sorting in EntityRuler to prefer the first span in overlapping spans.	2020-07-31 16:09:32 +02:00
svlandeg	5fa3235d06	set DATA_VALIDATION to False for debug_model (upgrade thinc)	2020-07-31 15:21:01 +02:00
svlandeg	08d3c36c20	bugfix in train CLI	2020-07-31 15:03:43 +02:00
Adriane Boyd	9b509aa87f	Move Language.evaluate scorer config to new arg Move `Language.evaluate` scorer config from `component_cfg` to separate argument `scorer_cfg`.	2020-07-31 11:05:16 +02:00
Adriane Boyd	901801b33b	Fix default arguments in DependencyParser.score	2020-07-31 10:55:44 +02:00
Adriane Boyd	9d79916792	Merge branch 'develop' into feature/scorer-adjustments	2020-07-31 10:48:14 +02:00
Sofie Van Landeghem	ca491722ad	The Parser is now a Pipe (2) (#5844 ) * moving syntax folder to _parser_internals * moving nn_parser and transition_system * move nn_parser and transition_system out of internals folder * moving nn_parser code into transition_system file * rename transition_system to transition_parser * moving parser_model and _state to ml * move _state back to internals * The Parser now inherits from Pipe! * small code fixes * removing unnecessary imports * remove link_vectors_to_models * transition_system to internals folder * little bit more cleanup * newlines	2020-07-30 23:30:54 +02:00
svlandeg	0b23594953	pipe_name instead of section in debug_model	2020-07-30 20:06:28 +02:00
Rahul Gupta	f76fae0e8d	English: adds ordinal numbers (#5830 )	2020-07-29 20:22:47 +02:00
Ines Montani	7a21775cd0	Merge pull request #5834 from explosion/feature/vectors	2020-07-29 18:49:26 +02:00
Gustavo Zadrozny Leyendecker	90b958fd01	Fix on EntityRendered to support break lines (after last entity) (closes #5838 )	2020-07-29 18:48:39 +02:00
Ines Montani	b0f57a0cac	Update docs and consistency	2020-07-29 15:14:07 +02:00
Matthew Honnibal	a2d573c039	Merge branch 'feature/vectors' of https://github.com/explosion/spaCy into feature/vectors	2020-07-29 14:56:27 +02:00
Matthew Honnibal	2af741d7e3	Fix train arg	2020-07-29 14:56:01 +02:00
Matthew Honnibal	c27309f839	Merge branch 'develop' into feature/vectors	2020-07-29 14:54:10 +02:00
Ines Montani	62266fb828	Fix broken type annotation	2020-07-29 14:49:49 +02:00
Matthew Honnibal	142b58be92	Fix import	2020-07-29 14:45:09 +02:00
Matthew Honnibal	c99a653070	Adjust textcat model	2020-07-29 14:38:15 +02:00
Matthew Honnibal	9e1b11dd81	Update vectors in textcat	2020-07-29 14:35:36 +02:00
Matthew Honnibal	105cf29967	Fix DocBin	2020-07-29 14:23:13 +02:00
Ines Montani	ff0bc05da8	Fix docstrings [ci skip]	2020-07-29 14:09:37 +02:00
Ines Montani	6e2623d3f8	Fix docstring [ci skip]	2020-07-29 14:08:05 +02:00
Ines Montani	8d56260d92	Fix docstrings [ci skip]	2020-07-29 14:07:13 +02:00
Ines Montani	80b18124d2	Fix docstring [ci skip]	2020-07-29 14:03:35 +02:00
Matthew Honnibal	f0cf4a2dca	Update tests	2020-07-29 14:01:14 +02:00
Matthew Honnibal	07b47eaac8	Update tok2vec layer	2020-07-29 14:01:13 +02:00
Matthew Honnibal	5ae8628571	Fix CharacterEmbed layer	2020-07-29 14:01:13 +02:00
Matthew Honnibal	97d3651574	Fix stray link_vectors_to_models call	2020-07-29 14:01:13 +02:00
Matthew Honnibal	c7d1ece3eb	Update tests	2020-07-29 14:01:13 +02:00
Matthew Honnibal	00de30bcc2	Update CharacterEmbed function	2020-07-29 14:01:12 +02:00
Matthew Honnibal	6a6b09bd32	Update morphologizer model	2020-07-29 14:01:12 +02:00
Matthew Honnibal	20e9098e3f	Update tests	2020-07-29 14:01:12 +02:00
Matthew Honnibal	c35d6282fc	Add previous HashEmbedCNN tok2vec to make transition easier	2020-07-29 14:01:12 +02:00
Matthew Honnibal	1784c95827	Clean up link_vectors_to_models unused stuff	2020-07-29 14:01:11 +02:00
Matthew Honnibal	0c17ea4c85	Format	2020-07-29 14:00:13 +02:00
Matthew Honnibal	2aff3c4b5a	Load vectors in 'spacy train'	2020-07-29 14:00:13 +02:00
Matthew Honnibal	7852a68a75	Fix load_vectors_into_model function	2020-07-29 14:00:13 +02:00
Matthew Honnibal	7299419fe4	Dont load vectors in Language.from_config	2020-07-29 14:00:12 +02:00
Matthew Honnibal	30dd96c540	Load vectors in Language.from_config	2020-07-29 14:00:12 +02:00
Matthew Honnibal	df95e2af64	Add load_vectors_into_model util	2020-07-29 14:00:12 +02:00
Matthew Honnibal	475d7c1c7c	Fix StaticVectors class	2020-07-29 14:00:11 +02:00
Matthew Honnibal	44d350dc94	Use spaCy's StaticVectors	2020-07-29 14:00:11 +02:00
Matthew Honnibal	acc64e138a	Add import	2020-07-29 14:00:11 +02:00
Matthew Honnibal	9987ea9e4d	Fix Tok2Vec begin_training	2020-07-29 14:00:10 +02:00
Matthew Honnibal	099e9331c5	Fix tok2vec	2020-07-29 14:00:10 +02:00
Matthew Honnibal	fe0cdcd461	Fixes	2020-07-29 14:00:09 +02:00
Matthew Honnibal	123f8b832d	Refactor Tok2Vec model	2020-07-29 14:00:09 +02:00
Matthew Honnibal	c6b4f63c7c	Remove obsolete function	2020-07-29 14:00:09 +02:00
Matthew Honnibal	9cc7262224	Draft StaticVectors layer	2020-07-29 14:00:09 +02:00
Matthew Honnibal	cb9654e98c	WIP on new StaticVectors	2020-07-29 14:00:09 +02:00
Ines Montani	e257e66ab9	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2020-07-29 11:36:45 +02:00
Ines Montani	e0ffe36e79	Update docstrings, docs and types	2020-07-29 11:36:42 +02:00
Sofie Van Landeghem	40c995b1be	Option for returning only greedy matches (#5771 ) * add "greedy" option for match pattern * distinction between greedy FIRST or LONGEST * check for proper values, throw custom warning otherwise * unxfail one more test * add comment in docstring * add test that LONGEST also prefers first match if equal length * use c arrays for more efficient processing * rename 'greediness' to 'greedy'	2020-07-29 11:04:43 +02:00
Adriane Boyd	191a12d75f	Fix score_weights typo in train CLI (#5835 )	2020-07-29 11:04:12 +02:00
Adriane Boyd	0cddb0dbe9	Move timing into Language.evaluate (#5836 ) Move timing into `Language.evaluate` so that only the processing is timing, not processing + scoring. `Language.evaluate` returns `scores["speed"]` as words per second, which should be identical to how the speed was added to the scores previously. Also add the speed to the evaluate CLI output.	2020-07-29 11:02:31 +02:00
Adriane Boyd	c689ae8f0a	Fix types in Scorer	2020-07-29 10:40:30 +02:00
oculusrepairo	03ab518f28	Update examples.py (#5820 ) * Update examples.py adding factual sentences to the list * Add missing comma separators Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2020-07-29 10:28:56 +02:00
Ines Montani	7adffc5361	Remove unused schema	2020-07-28 23:12:47 +02:00
Ines Montani	e5d9eaf79c	Tidy up docstrings and arguments	2020-07-28 23:12:42 +02:00
Ines Montani	ac24adec73	Small adjustments to Scorer and docs	2020-07-28 21:39:42 +02:00
Ines Montani	2c7a32cf12	Remove unused methods	2020-07-28 16:50:02 +02:00
Ines Montani	ba22111ff4	Move error to Errors	2020-07-28 16:24:14 +02:00
Ines Montani	2748249217	Re-add meta["pipeline"] for now	2020-07-28 16:14:23 +02:00
Ines Montani	b83ead5bf5	Merge pull request #5824 from svlandeg/fix/textcat-v3	2020-07-28 15:04:25 +02:00
Ines Montani	06a97a8766	Support --opt=value format in CLI config overrides	2020-07-28 13:43:15 +02:00
Ines Montani	ae4d8a6ffd	Update docstrings, docs and pipe consistency	2020-07-28 13:37:31 +02:00
Ines Montani	0094cb0d04	Remove scores list from config and document	2020-07-28 11:22:24 +02:00
graue70	b97dbab998	Fix typo in unit tests (#5823 )	2020-07-27 20:18:48 +02:00
Ines Montani	894e20c466	Merge branch 'develop' into feature/component-scores	2020-07-27 18:14:39 +02:00
Ines Montani	d8b519c23c	API docs, docstrings and argument consistency	2020-07-27 18:11:45 +02:00
svlandeg	85b2dcfd67	cleanup	2020-07-27 17:54:44 +02:00
svlandeg	61068e0fb1	util function dot_to_object and corresponding unit test	2020-07-27 17:50:12 +02:00
Ines Montani	10b84e1e27	Add flag to toggle sdist creation on package [ci skip]	2020-07-27 16:52:23 +02:00
Adriane Boyd	34c92dfe63	Add missing Scorer imports	2020-07-27 15:08:51 +02:00
Adriane Boyd	8bb0507777	Add and update score methods and score weights Add and update `score` methods, provided `scores`, and default weights `default_score_weights` for pipeline components. * `scores` provides all top-level keys returned by `score` (merely informative, similar to `assigns`). * `default_score_weights` provides the default weights for a default config. * The keys from `default_score_weights` determine which values will be shown in the `spacy train` output, so keys with weight `0.0` will be displayed but not counted toward the overall score.	2020-07-27 14:44:53 +02:00
Adriane Boyd	baf19fd652	Update cats scoring to provide overall score * Provide top-level score as `attr_score` * Provide a description of the score as `attr_score_desc` * Provide all potential scores keys, setting unused keys to `None` * Update CLI evaluate accordingly	2020-07-27 12:26:10 +02:00
Adriane Boyd	f8cf378be9	Combine weights from multiple components Combine weights from multiple components for the same score.	2020-07-27 10:21:31 +02:00
Ines Montani	3d56a3f286	Make more args keyword-only	2020-07-27 00:27:53 +02:00
Matthew Honnibal	80271ac0ba	Update default config	2020-07-26 15:27:39 +02:00
Ines Montani	ed61fb10fc	Rename default textcat arch to TextCatEnsemble	2020-07-26 15:11:43 +02:00
Ines Montani	53d37da29a	Make sure @factories is removed from config	2020-07-26 15:11:24 +02:00
Ines Montani	4060c2d5a6	Fix test	2020-07-26 13:40:19 +02:00
Ines Montani	2470486543	Allow pipeline components to set default scores and weights	2020-07-26 13:18:43 +02:00
Ines Montani	787d066e22	Remove pipes.pyx Probably accidentally re-added in a merge?	2020-07-26 13:08:52 +02:00
Matthew Honnibal	520d25cb50	Add smart_open dependency to fetch project assets (#5812 ) * Use smart_open for project assets * Fix assets.py * Update pyproject.toml	2020-07-26 12:15:00 +02:00
Ines Montani	e92df281ce	Tidy up, autoformat, add types	2020-07-25 15:01:15 +02:00
Matthew Honnibal	71242327b2	Set version to v3.0.0a5	2020-07-25 14:06:01 +02:00
Ines Montani	cdbd6ba912	Merge pull request #5798 from explosion/feature/language-data-config	2020-07-25 13:34:49 +02:00
Ines Montani	49f27a2a7b	Tidy up [ci skip]	2020-07-25 13:00:49 +02:00
Ines Montani	4a0a692875	Add missing lex_attr_getters (resolves #5806 )	2020-07-25 12:55:18 +02:00
Adriane Boyd	2bcceb80c4	Refactor the Scorer to improve flexibility (#5731 ) * Refactor the Scorer to improve flexibility Refactor the `Scorer` to improve flexibility for arbitrary pipeline components. * Individual pipeline components provide their own `evaluate` methods that score a list of `Example`s and return a dictionary of scores * `Scorer` is initialized either: * with a provided pipeline containing components to be scored * with a default pipeline containing the built-in statistical components (senter, tagger, morphologizer, parser, ner) * `Scorer.score` evaluates a list of `Example`s and returns a dictionary of scores referring to the scores provided by the components in the pipeline Significant differences: * `tags_acc` is renamed to `tag_acc` to be consistent with `token_acc` and the new `morph_acc`, `pos_acc`, and `lemma_acc` * Scoring is no longer cumulative: `Scorer.score` scores a list of examples rather than a single example and does not retain any state about previously scored examples * PRF values in the returned scores are no longer multiplied by 100 * Add kwargs to Morphologizer.evaluate * Create generalized scoring methods in Scorer * Generalized static scoring methods are added to `Scorer` * Methods require an attribute (either on Token or Doc) that is used to key the returned scores Naming differences: * `uas`, `las`, and `las_per_type` in the scores dict are renamed to `dep_uas`, `dep_las`, and `dep_las_per_type` Scoring differences: * `Doc.sents` is now scored as spans rather than on sentence-initial token positions so that `Doc.sents` and `Doc.ents` can be scored with the same method (this lowers scores since a single incorrect sentence start results in two incorrect spans) * Simplify / extend hasattr check for eval method * Add hasattr check to tokenizer scoring * Simplify to hasattr check for component scoring * Reset Example alignment if docs are set Reset the Example alignment if either doc is set in case the tokenization has changed. * Add PRF tokenization scoring for tokens as spans Add PRF scores for tokens as character spans. The scores are: * token_acc: # correct tokens / # gold tokens * token_p/r/f: PRF for (token.idx, token.idx + len(token)) * Add docstring to Scorer.score_tokenization * Rename component.evaluate() to component.score() * Update Scorer API docs * Update scoring for positive_label in textcat * Fix TextCategorizer.score kwargs * Update Language.evaluate docs * Update score names in default config	2020-07-25 12:53:02 +02:00
Ines Montani	c003d26b94	Tidy up	2020-07-25 12:21:37 +02:00
Ines Montani	a063a82c40	Tidy up __init__.py	2020-07-25 12:14:37 +02:00
Ines Montani	8d9d28eb8b	Re-add setting for vocab data and tidy up	2020-07-25 12:14:28 +02:00
Ines Montani	b9aaa4e457	Improve vocab data integration and warning	2020-07-25 11:51:30 +02:00
Ines Montani	38f6ea7a78	Simplify language data and revert detailed configs	2020-07-24 14:50:26 +02:00
Adriane Boyd	656574a01a	Update Japanese tests (#5807 ) * Update POS tests to reflect current behavior (it is not entirely clear whether the AUX/VERB mapping is indeed the desired behavior?) * Switch to `from_config` initialization in subtoken test	2020-07-24 12:45:14 +02:00
Adriane Boyd	fdb8815ef5	Minor refactor for Morphology and MorphAnalysis (#5804 ) * `MorphAnalysis.get` returns only the field values * Move `_normalize_props` inside `Morphology` as `Morphology.normalize_attrs` and simplify * Simplify POS field detection/conversion * Convert all non-POS features to strings * `Morphology` returns an empty string for a missing morph to align with the FEATS string returned for an existing morph * Remove unused `list_to_feats`	2020-07-24 09:28:06 +02:00
Adriane Boyd	19dc42776a	Remove hard-coded GPU ID from pretrain (#5808 )	2020-07-24 09:26:26 +02:00
Joshua Olson	6d4d5c074c	Mark Japanese documents as tagged. (#5803 ) Mark the document as tagged before returning it to the user from the JapaneseTokenizer. Fixes #5802	2020-07-23 08:57:01 +02:00
Ines Montani	87737a5a60	Tidy up	2020-07-23 00:16:23 +02:00
Ines Montani	a624ae0675	Remove POS, TAG and LEMMA from tokenizer exceptions	2020-07-22 23:09:01 +02:00
Ines Montani	14d7d46f89	Merge branch 'develop' into feature/language-data-config	2020-07-22 22:18:53 +02:00
Ines Montani	b507f61629	Tidy up and move noun_chunks, token_match, url_match	2020-07-22 22:18:46 +02:00
Ines Montani	7fc4dadd22	Fix typo	2020-07-22 20:27:22 +02:00
Ines Montani	d0c6d1efc5	@factories -> factory (#5801 )	2020-07-22 17:29:31 +02:00
Ines Montani	2c5bb59909	Use consistent --gpu-id option name	2020-07-22 16:53:41 +02:00
Adriane Boyd	038ff1a811	Improve warnings around normalization tables (#5794 ) Provide more customized normalization table warnings when training a new model. Only suggest installing `spacy-lookups-data` if it's not already installed and it includes a table for this language (currently checked in a hard-coded list).	2020-07-22 16:04:58 +02:00
Adriane Boyd	bf24f7f672	Update invalid tag maps (#5796 ) * Remove copy of (old?) PTB tag map for: bn, eu * Remove unsupported features from: hy, pl, ro, ru	2020-07-22 16:02:51 +02:00
Ines Montani	0fcd352179	Remove omit_extra_lookups	2020-07-22 16:01:17 +02:00
Ines Montani	945f795a3e	WIP: move more language data to config	2020-07-22 15:59:37 +02:00
Adriane Boyd	b84fd70cc3	Fix exceptions for Morphology.__reduce__ (#5792 ) Pickle exceptions in the MORPH_RULES format instead of the internal format after the recent `Morphology.__init__` changes.	2020-07-22 15:00:25 +02:00
Ines Montani	43b960c01b	Refactor pipeline components, config and language data (#5759 ) * Update with WIP * Update with WIP * Update with pipeline serialization * Update types and pipe factories * Add deep merge, tidy up and add tests * Fix pipe creation from config * Don't validate default configs on load * Update spacy/language.py Co-authored-by: Ines Montani <ines@ines.io> * Adjust factory/component meta error * Clean up factory args and remove defaults * Add test for failing empty dict defaults * Update pipeline handling and methods * provide KB as registry function instead of as object * small change in test to make functionality more clear * update example script for EL configuration * Fix typo * Simplify test * Simplify test * splitting pipes.pyx into separate files * moving default configs to each component file * fix batch_size type * removing default values from component constructors where possible (TODO: test 4725) * skip instead of xfail * Add test for config -> nlp with multiple instances * pipeline.pipes -> pipeline.pipe * Tidy up, document, remove kwargs * small cleanup/generalization for Tok2VecListener * use DEFAULT_UPSTREAM field * revert to avoid circular imports * Fix tests * Replace deprecated arg * Make model dirs require config * fix pickling of keyword-only arguments in constructor * WIP: clean up and integrate full config * Add helper to handle function args more reliably Now also includes keyword-only args * Fix config composition and serialization * Improve config debugging and add visual diff * Remove unused defaults and fix type * Remove pipeline and factories from meta * Update spacy/default_config.cfg Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update spacy/default_config.cfg * small UX edits * avoid printing stack trace for debug CLI commands * Add support for language-specific factories * specify the section of the config which holds the model to debug * WIP: add Language.from_config * Update with language data refactor WIP * Auto-format * Add backwards-compat handling for Language.factories * Update morphologizer.pyx * Fix morphologizer * Update and simplify lemmatizers * Fix Japanese tests * Port over tagger changes * Fix Chinese and tests * Update to latest Thinc * WIP: xfail first Russian lemmatizer test * Fix component-specific overrides * fix nO for output layers in debug_model * Fix default value * Fix tests and don't pass objects in config * Fix deep merging * Fix lemma lookup data registry Only load the lookups if an entry is available in the registry (and if spacy-lookups-data is installed) * Add types * Add Vocab.from_config * Fix typo * Fix tests * Make config copying more elegant * Fix pipe analysis * Fix lemmatizers and is_base_form * WIP: move language defaults to config * Fix morphology type * Fix vocab * Remove comment * Update to latest Thinc * Add morph rules to config * Tidy up * Remove set_morphology option from tagger factory * Hack use_gpu * Move [pipeline] to top-level block and make [nlp.pipeline] list Allows separating component blocks from component order – otherwise, ordering the config would mean a changed component order, which is bad. Also allows initial config to define more components and not use all of them * Fix use_gpu and resume in CLI * Auto-format * Remove resume from config * Fix formatting and error * [pipeline] -> [components] * Fix types * Fix tagger test: requires set_morphology? Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com> Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-07-22 13:42:59 +02:00
Ines Montani	311d0bde29	Merge pull request #5788 from explosion/master-tmp	2020-07-20 15:39:24 +02:00
Ines Montani	d51db72e46	Remove Python 2 marker	2020-07-20 15:01:36 +02:00
Ines Montani	644074b954	Merge branch 'develop' into master-tmp	2020-07-20 14:58:04 +02:00
Sofie Van Landeghem	c9da9605f7	Test suite clean up (#5781 ) * step_through tests: skip instead of xfail * test_empty_doc should be fixed with new Thinc version * remove outdated test (there are other misaligned tests now) * xfail reason * fix test according to french exceptions * clarified some skipped tests * skip ukranian test instead of xfail * skip instead of xfail * skip + reason instead of xfail * removed obsolete tests referring to removed "set_frozen" functionality * fix test 999 * remove unused AlignmentError * remove xfail where possible, skip otherwise * increment thinc release for empty_doc test	2020-07-20 14:49:54 +02:00
Sofie Van Landeghem	1b2ec94382	Hyphen infix (#5770 ) * infix split on hyphen when preceded by number * clean up * skip ukranian test instead of xfail	2020-07-20 14:48:51 +02:00
Adriane Boyd	ec819fc311	Provide default output for evaluate in CLI (#5784 )	2020-07-20 14:42:46 +02:00
Ines Montani	cb65b36839	Merge pull request #5767 from adrianeboyd/feature/remove-tag-maps	2020-07-19 15:15:34 +02:00
Ines Montani	fa3c98f8b3	Update train.py	2020-07-19 13:40:47 +02:00
Ines Montani	796f6c52d1	Merge branch 'develop' into pr/5767	2020-07-19 13:37:46 +02:00
Adriane Boyd	39ebcd9ec9	Refactor Chinese tokenizer configuration (#5736 ) * Refactor Chinese tokenizer configuration Refactor `ChineseTokenizer` configuration so that it uses a single `segmenter` setting to choose between character segmentation, jieba, and pkuseg. * replace `use_jieba`, `use_pkuseg`, `require_pkuseg` with the setting `segmenter` with the supported values: `char`, `jieba`, `pkuseg` * make the default segmenter plain character segmentation `char` (no additional libraries required) * Fix Chinese serialization test to use char default * Warn if attempting to customize other segmenter Add a warning if `Chinese.pkuseg_update_user_dict` is called when another segmenter is selected.	2020-07-19 13:34:37 +02:00
Adriane Boyd	9ee1c54f40	Improve tag map initialization and updating (#5764 ) * Improve tag map initialization and updating Generalize tag map initialization and updating so that the tag map can be loaded correctly prior to loading a `Corpus` with `spacy debug-data` and `spacy train`. * normalize provided tag map as necessary * use the same method for initializing and updating the tag map * Replace rather than update tag map Replace rather than update tag map when loading a custom tag map. Updating the tag map is problematic due to the sorted list of tag names and the fact that the tag map will contain lingering/unwanted tags from the default tag map. * Update CLI scripts * Reinitialize cache after loading new tag map Reinitialize the cache with the right size after loading a new tag map.	2020-07-19 13:13:57 +02:00

... 11 12 13 14 15 ...

8520 Commits