spaCy

mirror of https://github.com/explosion/spaCy.git synced 2024-12-27 02:16:32 +03:00

Author	SHA1	Message	Date
Ines Montani	7f68f4bd92	Hide jsonl_loc on init vectors and tidy up [ci skip]	2020-10-01 16:44:17 +02:00
Ines Montani	0a8a124a6e	Update docs [ci skip]	2020-10-01 12:15:53 +02:00
Ines Montani	44160cd52f	Tidy up [ci skip]	2020-10-01 10:41:19 +02:00
Matthew Honnibal	59294e91aa	Restore the 'jsonl' arg for init vectors The lexemes.jsonl file is still used in our English vectors, and it may be required by users as well. I think it's worth supporting the option.	2020-09-30 19:06:50 +02:00
Ines Montani	23c63eefaf	Tidy up env vars [ci skip]	2020-09-30 15:15:11 +02:00
Elijah Rippeth	4cbb954281	reorder so tagmap is replaced only if a custom file is provided. (#6164 ) * reorder so tagmap is replaced only if a custom file is provided. * Remove unneeded variable initialization Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2020-09-30 13:26:06 +02:00
Ines Montani	a5debb356d	Tidy up and adjust logging [ci skip]	2020-09-30 01:22:08 +02:00
Ines Montani	56a2f778c4	Add logging [ci skip]	2020-09-30 01:08:55 +02:00
Ines Montani	fe3f111c37	Merge pull request #6168 from explosion/fix/default-corpus-values	2020-09-30 00:24:02 +02:00
Ines Montani	ae51843468	Remove augmenter from jinja template [ci skip]	2020-09-29 23:08:50 +02:00
Ines Montani	9bb958fd0a	Fix debug data [ci skip]	2020-09-29 23:07:11 +02:00
Ines Montani	df8dd91b6f	Merge branch 'develop' into fix/default-corpus-values	2020-09-29 22:55:39 +02:00
Ines Montani	0a1ee109db	Remove init form path	2020-09-29 22:53:18 +02:00
Ines Montani	c334a7d45f	Remove	2020-09-29 22:38:39 +02:00
Ines Montani	1aeef3bfbb	Make corpus paths default to None and improve errors	2020-09-29 22:33:46 +02:00
Ines Montani	0250bcf6a3	Show validation error during init	2020-09-29 22:29:09 +02:00
Ines Montani	43c92ec8c9	Resolve dir for better output [ci skip]	2020-09-29 22:01:04 +02:00
Ines Montani	fa47f87924	Tidy up and auto-format	2020-09-29 21:39:28 +02:00
Ines Montani	604be54a5c	Support --code in evaluate CLI [ci skip]	2020-09-29 21:20:56 +02:00
Ines Montani	d3c63b7965	Merge branch 'develop' into feature/prepare	2020-09-29 20:53:05 +02:00
Ines Montani	2be80379ec	Fix small issues, resolve_dot_names and debug model	2020-09-29 20:38:35 +02:00
Ines Montani	71a0ee274a	Move init labels to init pipeline module	2020-09-29 18:09:33 +02:00
Ines Montani	534e1ef498	Fix template	2020-09-29 17:02:55 +02:00
Matthew Honnibal	10847c7f4e	Fix arg	2020-09-29 16:48:07 +02:00
Matthew Honnibal	e70a00fa76	Remove unnecessary warning from train	2020-09-29 16:47:54 +02:00
Matthew Honnibal	3f0d61232d	Remove outdated arg from train	2020-09-29 16:47:44 +02:00
Matthew Honnibal	e957d66b92	Merge branch 'feature/prepare' of https://github.com/explosion/spaCy into feature/prepare	2020-09-29 16:22:53 +02:00
Matthew Honnibal	45daf5c9fe	Add init labels command	2020-09-29 16:22:37 +02:00
Ines Montani	aa2a6882d0	Fix logging	2020-09-29 16:08:39 +02:00
Sofie Van Landeghem	6a04e5adea	encoding UTF8 (#6161 )	2020-09-29 14:49:55 +02:00
Ines Montani	4925ad760a	Add init vectors	2020-09-29 10:58:50 +02:00
Ines Montani	ff9a63bfbd	begin_training -> initialize	2020-09-28 21:35:09 +02:00
Ines Montani	a139fe672b	Fix typos and refactor CLI logging	2020-09-28 21:17:10 +02:00
Ines Montani	2e9c9e74af	Fix config resolution and interpolation TODO: auto-interpolate in Thinc if config is dict (i.e. likely subsection)	2020-09-28 15:34:00 +02:00
Ines Montani	822ea4ef61	Refactor CLI	2020-09-28 15:09:59 +02:00
Ines Montani	a89e0ff7cb	Fix typo	2020-09-28 12:55:21 +02:00
Ines Montani	a62337b3f3	Tidy up vocab init	2020-09-28 12:53:06 +02:00
Ines Montani	c22ecc66bb	Don't support init path for now	2020-09-28 12:46:28 +02:00
Ines Montani	a5f2cc0509	Tidy up and remove raw text (rehearsal) for now	2020-09-28 12:30:13 +02:00
Ines Montani	1590de11b1	Update config	2020-09-28 12:05:23 +02:00
Ines Montani	e44a7519cd	Update CLI and add [initialize] block	2020-09-28 11:56:14 +02:00
Ines Montani	d5155376fd	Update vocab init	2020-09-28 11:30:18 +02:00
Ines Montani	8b74fd19df	init pipeline -> init nlp	2020-09-28 11:13:38 +02:00
Ines Montani	2fdb7285a0	Update CLI	2020-09-28 11:06:07 +02:00
Ines Montani	553bfea641	Fix commands	2020-09-28 10:53:17 +02:00
Matthew Honnibal	44bad1474c	Add init_pipeline file	2020-09-28 09:47:34 +02:00
Matthew Honnibal	b886f53c31	init-pipeline runs (maybe doesnt work)	2020-09-28 03:42:47 +02:00
Matthew Honnibal	ed2aff2db3	Remove unused train code	2020-09-28 03:12:31 +02:00
Matthew Honnibal	3a0a3b8db6	Dont hard-code for 'corpora' name	2020-09-28 03:06:33 +02:00
Matthew Honnibal	a976da168c	Support data augmentation in Corpus (#6155 ) * Support data augmentation in Corpus * Note initial docs for data augmentation * Add augmenter to quickstart * Fix flake8 * Format * Fix test * Update spacy/tests/training/test_training.py * Improve data augmentation arguments * Update templates * Move randomization out into caller * Refactor * Update spacy/training/augment.py * Update spacy/tests/training/test_training.py * Fix augment * Fix test	2020-09-28 03:03:27 +02:00
Matthew Honnibal	a3e1791c9c	Upd train	2020-09-28 01:08:30 +02:00
Matthew Honnibal	b5556093e2	Start updating train script	2020-09-27 23:59:44 +02:00
Ines Montani	e04bd16f7f	Merge branch 'develop' into feature/new-thinc-config-resolution	2020-09-27 22:34:46 +02:00
Ines Montani	d7ad65a9bb	Fix handling of error description [ci skip]	2020-09-27 22:31:57 +02:00
Ines Montani	7e938ed63e	Update config resolution to use new Thinc	2020-09-27 22:21:31 +02:00
Matthew Honnibal	39b178999c	Tmp notes	2020-09-27 20:13:38 +02:00
Ines Montani	b4486d747d	Merge branch 'develop' into fix/train-config-interpolation	2020-09-26 15:32:14 +02:00
Ines Montani	b2d07de786	Construct nlp from uninterpolated config before training	2020-09-26 15:16:59 +02:00
Ines Montani	ca3c997062	Improve CLI config validation with latest Thinc	2020-09-26 13:13:57 +02:00
Matthew Honnibal	3d8388969e	Sort paths for cache consistency	2020-09-25 19:07:26 +02:00
Sofie Van Landeghem	009ba14aaf	Fix pretraining in train script (#6143 ) * update pretraining API in train CLI * bump thinc to 8.0.0a35 * bump to 3.0.0a26 * doc fixes * small doc fix	2020-09-25 15:47:10 +02:00
Matthew Honnibal	74ee456374	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2020-09-24 16:11:47 +02:00
Matthew Honnibal	0bc214c102	Fix pull	2020-09-24 16:11:33 +02:00
Ines Montani	74e1f192b4	Merge pull request #6134 from explosion/feature/training_before_to_disk	2020-09-24 14:44:11 +02:00
Ines Montani	24e7ac3f2b	Fix download CLI [ci skip]	2020-09-24 14:43:56 +02:00
Ines Montani	88e54caa12	accuracy -> performance	2020-09-24 14:32:35 +02:00
Ines Montani	be56c0994b	Add [training.before_to_disk] callback	2020-09-24 12:40:25 +02:00
Ines Montani	c6c67b606e	Merge pull request #6133 from explosion/fix/score_weights	2020-09-24 12:00:57 +02:00
Ines Montani	f69fea8b25	Improve error handling around non-number scores	2020-09-24 11:29:07 +02:00
Matthew Honnibal	17a6b0a173	Make project pull order insensitive (#6131 )	2020-09-24 10:30:42 +02:00
Ines Montani	ae51f580c1	Fix handling of score_weights	2020-09-24 10:27:33 +02:00
svlandeg	35dbc63578	Merge remote-tracking branch 'upstream/develop' into fix/nr_features # Conflicts: # spacy/ml/models/parser.py # spacy/tests/serialize/test_serialize_config.py # website/docs/api/architectures.md	2020-09-23 17:01:13 +02:00
svlandeg	dd2292793f	'parser' instead of 'deps' for state_type	2020-09-23 16:53:49 +02:00
svlandeg	6c85fab316	state_type and extra_state_tokens instead of nr_feature_tokens	2020-09-23 13:35:09 +02:00
Ines Montani	7745d77a38	Fix whitespace in template [ci skip]	2020-09-23 13:21:42 +02:00
svlandeg	6435458d51	simplify expression	2020-09-23 12:12:38 +02:00
svlandeg	20b0ec5dcf	avoid logging performance of frozen components	2020-09-23 10:37:12 +02:00
Ines Montani	6ca06cb62c	Update docs and formatting [ci skip]	2020-09-23 10:14:27 +02:00
Ines Montani	888f936a73	Merge pull request #6106 from svlandeg/feature/textcat-quickstart	2020-09-23 10:11:45 +02:00
Ines Montani	60a317520a	Merge pull request #6109 from svlandeg/feature/2rename	2020-09-23 09:47:12 +02:00
svlandeg	556f3e4652	add pooling to NEL's TransformerListener	2020-09-23 09:24:28 +02:00
Sofie Van Landeghem	86a08f819d	tok2vec.update instead of predict (#6113 )	2020-09-22 21:54:52 +02:00
Ines Montani	5e3b796b12	Validate section refs in debug config	2020-09-22 12:24:39 +02:00
svlandeg	085a1c8e2b	add no_output_layer to TextCatBOW config	2020-09-22 12:06:40 +02:00
svlandeg	b556a10808	rename converts in_to_out	2020-09-22 11:50:19 +02:00
svlandeg	e931f4d757	add textcat score	2020-09-22 10:56:43 +02:00
svlandeg	396b33257f	add entity_linker to jinja template	2020-09-22 10:40:05 +02:00
svlandeg	135de82a2d	add textcat to quickstart	2020-09-22 10:22:06 +02:00
Ines Montani	6316d5f398	Improve messages in project CLI [ci skip]	2020-09-22 09:45:34 +02:00
Ines Montani	81606b29bd	Merge pull request #6104 from svlandeg/fix/debug_model [ci skip]	2020-09-22 09:31:23 +02:00
svlandeg	45b29c4a5b	cleanup	2020-09-21 23:17:23 +02:00
svlandeg	fa5c416db6	initialize through nlp object and with train_corpus	2020-09-21 23:09:22 +02:00
svlandeg	447b3e5787	Merge remote-tracking branch 'upstream/develop' into fix/debug_model # Conflicts: # spacy/cli/debug_model.py	2020-09-21 16:58:40 +02:00
Ines Montani	e8bcaa44f1	Don't auto-decompress archives with smart_open [ci skip]	2020-09-21 16:01:46 +02:00
svlandeg	eb9b447960	Merge remote-tracking branch 'upstream/develop' into fix/debug_model # Conflicts: # spacy/cli/debug_model.py	2020-09-21 14:05:16 +02:00
Ines Montani	758ead8a47	Sync overrides with CLI overrides	2020-09-21 12:50:13 +02:00
Ines Montani	5497acf49a	Support config overrides via environment variables	2020-09-21 11:25:10 +02:00
Ines Montani	1114219ae3	Tidy up and auto-format	2020-09-21 10:59:07 +02:00
Ines Montani	b2302c0a1c	Improve error for missing dependency	2020-09-20 17:44:51 +02:00
Matthew Honnibal	8fb59d958c	Format	2020-09-20 16:31:48 +02:00
Matthew Honnibal	dc22771f87	Fix sparse checkout	2020-09-20 16:30:05 +02:00
Matthew Honnibal	a0fb5e50db	Use simple git clone call if not sparse	2020-09-20 16:22:04 +02:00
Matthew Honnibal	2c24d633d0	Use updated run_command	2020-09-20 16:21:43 +02:00
Ines Montani	554c9a2497	Update docs [ci skip]	2020-09-20 12:30:53 +02:00
svlandeg	6db1d5dc0d	trying some stuff	2020-09-19 19:11:30 +02:00
Ines Montani	e863b3dc14	Merge pull request #6092 from adrianeboyd/bugfix/load-vocab-lookups-2	2020-09-19 12:33:38 +02:00
Sofie Van Landeghem	39872de1f6	Introducing the gpu_allocator (#6091 ) * rename 'use_pytorch_for_gpu_memory' to 'gpu_allocator' * --code instead of --code-path * update documentation * avoid querying the "system" section directly * add explanation of gpu_allocator to TF/PyTorch section in docs * fix typo * fix typo 2 * use set_gpu_allocator from thinc 8.0.0a34 * default null instead of empty string	2020-09-19 01:17:02 +02:00
svlandeg	73ff52b9ec	hack for tok2vec listener	2020-09-18 16:43:15 +02:00
Adriane Boyd	eed4b785f5	Load vocab lookups tables at beginning of training Similar to how vectors are handled, move the vocab lookups to be loaded at the start of training rather than when the vocab is initialized, since the vocab doesn't have access to the full config when it's created. The option moves from `nlp.load_vocab_data` to `training.lookups`. Typically these tables will come from `spacy-lookups-data`, but any `Lookups` object can be provided. The loading from `spacy-lookups-data` is now strict, so configs for each language should specify the exact tables required. This also makes it easier to control whether the larger clusters and probs tables are included. To load `lexeme_norm` from `spacy-lookups-data`: ``` [training.lookups] @misc = "spacy.LoadLookupsData.v1" lang = ${nlp.lang} tables = ["lexeme_norm"] ```	2020-09-18 15:59:16 +02:00
Ines Montani	a127fa475e	Merge pull request #6078 from svlandeg/fix/corpus	2020-09-18 14:44:21 +02:00
svlandeg	e4fc7e0222	fixing output sample to proper 2D array	2020-09-17 22:34:36 +02:00
Ines Montani	3865214343	Use consistent shortcut	2020-09-17 16:57:02 +02:00
svlandeg	35a3931064	fix typo	2020-09-17 16:36:27 +02:00
svlandeg	ddfc1fc146	add pretraining option to init config	2020-09-17 16:05:40 +02:00
svlandeg	427dbecdd6	cleanup and formatting	2020-09-17 11:48:04 +02:00
svlandeg	0c35885751	generalize corpora, dot notation for dev and train corpus	2020-09-17 11:38:59 +02:00
svlandeg	51fa929f47	rewrite train_corpus to corpus.train in config	2020-09-15 21:58:04 +02:00
Ines Montani	9cc304c194	Merge pull request #6064 from explosion/fix/sparse-checkout-ux Fix sparse checkout and error handling	2020-09-15 00:32:20 +02:00
Sofie Van Landeghem	3216a33149	positive_label config for textcat (#6062 ) * hook up positive_label in textcat * unit tests * documentation * formatting * tests * fix typo * move verify_config to after begin_training * revert accidential commit	2020-09-14 17:08:00 +02:00
Ines Montani	c052017025	Fix sparse checkout and error handling	2020-09-14 14:12:58 +02:00
Matthew Honnibal	54c40223a1	Improve v3 pretrain command (#6040 ) * Starts to run * Update pretrain script * Update corpus * Update pretrain schema * Remove outdated test * Make JsonlTexts produce Example objects.	2020-09-13 14:05:05 +02:00
Ines Montani	febb99916d	Tidy up and auto-format [ci skip]	2020-09-13 10:55:36 +02:00
Ines Montani	a5633b205f	Fix handling of errors around git [ci skip]	2020-09-13 10:52:28 +02:00
Ines Montani	f8846c198d	Update types and docstrings	2020-09-13 10:52:02 +02:00
Matthew Honnibal	37347830d4	Fix reading in GloVe vectors	2020-09-12 17:31:18 +02:00
Ines Montani	b41be87213	Merge pull request #6051 from svlandeg/feature/cli-config	2020-09-12 17:12:35 +02:00
Ines Montani	eedaaaec75	Fix handling of existing asset without checksum [ci skip]	2020-09-12 17:02:53 +02:00
svlandeg	a75cfe0da6	Merge remote-tracking branch 'upstream/develop' into feature/cli-config	2020-09-12 14:44:40 +02:00
svlandeg	115147804a	string_to_list to parse comma-separated string into a list	2020-09-12 14:43:22 +02:00
Ines Montani	f886f5bbc8	Merge pull request #6048 from explosion/fix/clone-compat	2020-09-12 10:30:49 +02:00
Ines Montani	0b2e07215d	Support overwriting name on spacy package	2020-09-11 11:38:28 +02:00
svlandeg	5b94aeece9	support pipeline as "list in string"	2020-09-11 11:08:46 +02:00
Ines Montani	1bce432b4a	Adjust message [ci skip]	2020-09-11 10:00:49 +02:00
Ines Montani	5acd4fbcd8	Merge branch 'develop' into fix/clone-compat	2020-09-11 09:58:30 +02:00
Ines Montani	761bd60d43	Adjust info message	2020-09-11 09:57:00 +02:00
Ines Montani	6831161bfa	Resolve path to be extra sure	2020-09-11 09:56:49 +02:00
svlandeg	1723fb73c4	remove brol	2020-09-10 17:44:59 +02:00
svlandeg	08a831ce83	process trailing slash if any	2020-09-10 17:39:52 +02:00
Ines Montani	3e83a509bb	WIP: fix project clone compatibility	2020-09-10 15:49:13 +02:00
svlandeg	f1bc09c1e9	restore partly	2020-09-10 14:53:02 +02:00
svlandeg	3889747119	asset fix & UX	2020-09-10 14:36:53 +02:00
svlandeg	a36766d153	hookup branch	2020-09-10 12:00:34 +02:00
svlandeg	97d99f7efa	Merge remote-tracking branch 'upstream/develop' into feature/doc-fixes	2020-09-10 11:51:34 +02:00
Ines Montani	908f3a4494	Update default projects repo [ci skip]	2020-09-10 11:42:14 +02:00
svlandeg	92f9d2f406	small UX fixes	2020-09-10 11:35:50 +02:00
svlandeg	1fc5486792	more fine-grained errors for git_sparse_checkout	2020-09-10 11:31:32 +02:00
Ines Montani	15bc3a37b4	Add --branch to project clone	2020-09-10 11:08:15 +02:00
Sofie Van Landeghem	8e7557656f	Renaming gold & annotation_setter (#6042 ) * version bump to 3.0.0a16 * rename "gold" folder to "training" * rename 'annotation_setter' to 'set_extra_annotations' * formatting	2020-09-09 10:31:03 +02:00
Sofie Van Landeghem	60f22e1800	Pipe API (#6034 ) * ensure Language passes on valid examples for initialization * fix tagger model initialization * check for valid get_examples across components * assume labels were added before begin_training * fix senter initialization * fix morphologizer initialization * use methods to check arguments * test textcat init, requires thinc>=8.0.0a31 * fix tok2vec init * fix entity linker init * use islice * fix simple NER * cleanup debug model * fix assert statements * fix tests * throw error when adding a label if the output layer can't be resized anymore * fix test * add failing test for simple_ner * UX improvements * morphologizer UX * assume begin_training gets a representative set and processes the labels * remove assumptions for output of untrained NER model * restore test for original purpose	2020-09-08 22:44:25 +02:00
Matthew Honnibal	ba5f4c9b32	Add words and seconds to train info	2020-09-08 15:24:47 +02:00

1 2 3 4 5 ...

1095 Commits