spaCy

mirror of https://github.com/explosion/spaCy.git synced 2024-12-27 10:26:35 +03:00

Author	SHA1	Message	Date
Ines Montani	e8bcaa44f1	Don't auto-decompress archives with smart_open [ci skip]	2020-09-21 16:01:46 +02:00
Adriane Boyd	6aa91c7ca0	Make user_data keyword-only	2020-09-21 16:00:06 +02:00
Adriane Boyd	177df15d89	Implement Doc.set_ents	2020-09-21 15:54:05 +02:00
Adriane Boyd	13fbf6556a	Merge remote-tracking branch 'upstream/develop' into feature/doc-ents-v3-2	2020-09-21 14:42:04 +02:00
svlandeg	eb9b447960	Merge remote-tracking branch 'upstream/develop' into fix/debug_model # Conflicts: # spacy/cli/debug_model.py	2020-09-21 14:05:16 +02:00
Adriane Boyd	ce455f30ca	Fix formatting	2020-09-21 13:53:29 +02:00
Adriane Boyd	bc02e86494	Extend Doc.__init__ with additional annotation Mostly copying from `spacy.tests.util.get_doc`, add additional kwargs to `Doc.__init__` to initialize the most common doc/token values.	2020-09-21 13:36:24 +02:00
Ines Montani	758ead8a47	Sync overrides with CLI overrides	2020-09-21 12:50:13 +02:00
Ines Montani	5497acf49a	Support config overrides via environment variables	2020-09-21 11:25:10 +02:00
Ines Montani	1114219ae3	Tidy up and auto-format	2020-09-21 10:59:07 +02:00
Ines Montani	b2302c0a1c	Improve error for missing dependency	2020-09-20 17:44:51 +02:00
Matthew Honnibal	8fb59d958c	Format	2020-09-20 16:31:48 +02:00
Matthew Honnibal	dc22771f87	Fix sparse checkout	2020-09-20 16:30:05 +02:00
Matthew Honnibal	a0fb5e50db	Use simple git clone call if not sparse	2020-09-20 16:22:04 +02:00
Matthew Honnibal	2c24d633d0	Use updated run_command	2020-09-20 16:21:43 +02:00
Matthew Honnibal	889128e5c5	Improve error handling in run_command	2020-09-20 16:20:57 +02:00
Ines Montani	554c9a2497	Update docs [ci skip]	2020-09-20 12:30:53 +02:00
svlandeg	6db1d5dc0d	trying some stuff	2020-09-19 19:11:30 +02:00
Ines Montani	e863b3dc14	Merge pull request #6092 from adrianeboyd/bugfix/load-vocab-lookups-2	2020-09-19 12:33:38 +02:00
Sofie Van Landeghem	39872de1f6	Introducing the gpu_allocator (#6091 ) * rename 'use_pytorch_for_gpu_memory' to 'gpu_allocator' * --code instead of --code-path * update documentation * avoid querying the "system" section directly * add explanation of gpu_allocator to TF/PyTorch section in docs * fix typo * fix typo 2 * use set_gpu_allocator from thinc 8.0.0a34 * default null instead of empty string	2020-09-19 01:17:02 +02:00
Adriane Boyd	47080fba98	Minor renaming / refactoring * Rename loader to `spacy.LookupsDataLoader.v1`, add debugging message * Make `Vocab.lookups` a property	2020-09-18 19:43:19 +02:00
svlandeg	73ff52b9ec	hack for tok2vec listener	2020-09-18 16:43:15 +02:00
Adriane Boyd	eed4b785f5	Load vocab lookups tables at beginning of training Similar to how vectors are handled, move the vocab lookups to be loaded at the start of training rather than when the vocab is initialized, since the vocab doesn't have access to the full config when it's created. The option moves from `nlp.load_vocab_data` to `training.lookups`. Typically these tables will come from `spacy-lookups-data`, but any `Lookups` object can be provided. The loading from `spacy-lookups-data` is now strict, so configs for each language should specify the exact tables required. This also makes it easier to control whether the larger clusters and probs tables are included. To load `lexeme_norm` from `spacy-lookups-data`: ``` [training.lookups] @misc = "spacy.LoadLookupsData.v1" lang = ${nlp.lang} tables = ["lexeme_norm"] ```	2020-09-18 15:59:16 +02:00
Ines Montani	a127fa475e	Merge pull request #6078 from svlandeg/fix/corpus	2020-09-18 14:44:21 +02:00
Matthew Honnibal	bbdb5f62b7	Temporary work-around for scoring a subset of components (#6090 ) * Try hacking the scorer to work around sentence boundaries * Upd scorer * Set dev version * Upd scorer hack * Fix version * Improve comment on hack	2020-09-18 14:26:42 +02:00
Adriane Boyd	a88106e852	Remove W106: HEAD and SENT_START in doc.from_array (#6086 ) * Remove W106: HEAD and SENT_START in doc.from_array This warning was hacky and being triggered too often. * Fix test	2020-09-18 03:01:29 +02:00
svlandeg	e4fc7e0222	fixing output sample to proper 2D array	2020-09-17 22:34:36 +02:00
Adriane Boyd	8b650f3a78	Modify setting missing and blocked entity tokens In order to make it easier to construct `Doc` objects as training data, modify how missing and blocked entity tokens are set to prioritize setting `O` and missing entity tokens for training purposes over setting blocked entity tokens. * `Doc.ents` setter sets tokens outside entity spans to `O` regardless of the current state of each token * For `Doc.ents`, setting a span with a missing label sets the `ent_iob` to missing instead of blocked * `Doc.block_ents(spans)` marks spans as hard `O` for use with the `EntityRecognizer`	2020-09-17 21:27:42 +02:00
Ines Montani	3865214343	Use consistent shortcut	2020-09-17 16:57:02 +02:00
svlandeg	35a3931064	fix typo	2020-09-17 16:36:27 +02:00
svlandeg	ddfc1fc146	add pretraining option to init config	2020-09-17 16:05:40 +02:00
svlandeg	427dbecdd6	cleanup and formatting	2020-09-17 11:48:04 +02:00
svlandeg	0c35885751	generalize corpora, dot notation for dev and train corpus	2020-09-17 11:38:59 +02:00
svlandeg	781fae678b	Merge remote-tracking branch 'upstream/develop' into fix/corpus	2020-09-17 09:24:36 +02:00
Matthew Honnibal	8303d101a5	Set version to v3.0.0a19	2020-09-17 00:18:49 +02:00
Adriane Boyd	7e4cd7575c	Refactor Docs.is_ flags (#6044 ) * Refactor Docs.is_ flags * Add derived `Doc.has_annotation` method * `Doc.has_annotation(attr)` returns `True` for partial annotation * `Doc.has_annotation(attr, require_complete=True)` returns `True` for complete annotation * Add deprecation warnings to `is_tagged`, `is_parsed`, `is_sentenced` and `is_nered` * Add `Doc._get_array_attrs()`, which returns a full list of `Doc` attrs for use with `Doc.to_array`, `Doc.to_bytes` and `Doc.from_docs`. The list is the `DocBin` attributes list plus `SPACY` and `LENGTH`. Notes on `Doc.has_annotation`: * `HEAD` is converted to `DEP` because heads don't have an unset state * Accept `IS_SENT_START` as a synonym of `SENT_START` Additional changes: * Add `NORM`, `ENT_ID` and `SENT_START` to default attributes for `DocBin` * In `Doc.from_array()` the presence of `DEP` causes `HEAD` to override `SENT_START` * In `Doc.from_array()` using `attrs` other than `Doc._get_array_attrs()` (i.e., a user's custom list rather than our default internal list) with both `HEAD` and `SENT_START` shows a warning that `HEAD` will override `SENT_START` * `set_children_from_heads` does not require dependency labels to set sentence boundaries and sets `sent_start` for all non-sentence starts to `-1` * Fix call to set_children_form_heads Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-09-17 00:14:01 +02:00
Adriane Boyd	a119667a36	Clean up spacy.tokens (#6046 ) * Clean up spacy.tokens * Update `set_children_from_heads`: * Don't check `dep` when setting lr_* or sentence starts * Set all non-sentence starts to `False` * Use `set_children_from_heads` in `Token.head` setter * Reduce similar/duplicate code (admittedly adds a bit of overhead) * Update sentence starts consistently * Remove unused `Doc.set_parse` * Minor changes: * Declare cython variables (to avoid cython warnings) * Clean up imports * Modify set_children_from_heads to set token range Modify `set_children_from_heads` so that it adjust tokens within a specified range rather then the whole document. Modify the `Token.head` setter to adjust only the tokens affected by the new head assignment.	2020-09-16 20:32:38 +02:00
Matthew Honnibal	c776594ab1	Fix	2020-09-16 18:15:14 +02:00
Matthew Honnibal	4a573d18b3	Add comment	2020-09-16 17:51:29 +02:00
Matthew Honnibal	d31afc8334	Fix Language.link_components when model is None	2020-09-16 17:49:48 +02:00
Adriane Boyd	f3db3f6fe0	Add vectors option to CharacterEmbed (#6069 ) * Add vectors option to CharacterEmbed * Update spacy/pipeline/morphologizer.pyx * Adjust default morphologizer config Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-09-16 17:45:04 +02:00
Adriane Boyd	d722a439aa	Remove unneeded methods in senter and morphologizer (#6074 ) Now that the tagger doesn't manage the tag map, the child classes senter and morphologizer don't need to override the serialization methods.	2020-09-16 17:39:41 +02:00
Adriane Boyd	87c329c711	Set rule-based lemmatizers as default (#6076 ) For languages without provided models and with lemmatizer rules in `spacy-lookups-data`, make the rule-based lemmatizer the default: Bengali, Persian, Norwegian, Swedish	2020-09-16 17:37:29 +02:00
svlandeg	1040e250d8	actual commit with test for custom readers with ml_datasets >= 0.2	2020-09-16 16:41:28 +02:00
svlandeg	714a5a05c6	test for custom readers with ml_datasets >= 0.2	2020-09-16 16:39:55 +02:00
svlandeg	0d1392340f	Merge remote-tracking branch 'upstream/develop' into fix/corpus	2020-09-15 23:17:08 +02:00
svlandeg	f420aa1138	use e.value to get to the ExceptionInfo value	2020-09-15 22:30:09 +02:00
svlandeg	7336657662	corpus is a Dict	2020-09-15 22:07:16 +02:00
svlandeg	51fa929f47	rewrite train_corpus to corpus.train in config	2020-09-15 21:58:04 +02:00
svlandeg	bd87e8686e	move tests to correct subdir	2020-09-15 21:40:38 +02:00
Ines Montani	aaf01689a1	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2020-09-15 14:24:42 +02:00
Ines Montani	91a6637f74	Remove extra pipe config values before merging	2020-09-15 14:24:17 +02:00
Ines Montani	d3d7f92f05	Fix lang check and error handling in Language.from_config	2020-09-15 14:24:06 +02:00
Ines Montani	2ed6e2a218	Auto-format	2020-09-15 14:20:04 +02:00
Ines Montani	2214d1bb7b	Merge pull request #6067 from explosion/feature/spacy-blank-from-config	2020-09-15 14:18:33 +02:00
Ines Montani	253ba5ef14	Raise for bad Vocab values	2020-09-15 13:25:34 +02:00
svlandeg	7677e5c0e2	fix wandb logger when calling multiple times from same script	2020-09-15 12:56:33 +02:00
Ines Montani	eff9406718	Support vocab arg in spacy.blank	2020-09-15 11:39:36 +02:00
Ines Montani	99549a5ace	Fix consistency and update docs	2020-09-15 11:37:37 +02:00
Ines Montani	7dfc4bc062	Allow overriding meta from spacy.blank	2020-09-15 11:12:12 +02:00
Ines Montani	0f943157af	Delegate to Language.from_config in spacy.blank	2020-09-15 11:07:55 +02:00
Ines Montani	e977086a9a	Update default pretraining config [ci skip]	2020-09-15 01:12:02 +02:00
Ines Montani	154752f9c2	Update docs and consistency [ci skip]	2020-09-15 00:32:49 +02:00
Ines Montani	9cc304c194	Merge pull request #6064 from explosion/fix/sparse-checkout-ux Fix sparse checkout and error handling	2020-09-15 00:32:20 +02:00
Matthew Honnibal	475323cd36	Set version to v3.0.0a18	2020-09-14 22:05:43 +02:00
Matthew Honnibal	e8378b57bc	Fix test	2020-09-14 21:21:13 +02:00
Matthew Honnibal	adf0bab23a	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2020-09-14 21:04:49 +02:00
Matthew Honnibal	ae15fa9688	Fix iob converter	2020-09-14 21:02:18 +02:00
Sofie Van Landeghem	3216a33149	positive_label config for textcat (#6062 ) * hook up positive_label in textcat * unit tests * documentation * formatting * tests * fix typo * move verify_config to after begin_training * revert accidential commit	2020-09-14 17:08:00 +02:00
Ines Montani	c052017025	Fix sparse checkout and error handling	2020-09-14 14:12:58 +02:00
Matthew Honnibal	fdd2340f6c	Set version to v3.0.0a17	2020-09-13 23:52:03 +02:00
Ines Montani	416deb412f	Prevent duplicate traceback on CalledProcessError [ci skip]	2020-09-13 19:28:54 +02:00
Ines Montani	61a4ef0b46	Fix syntax error	2020-09-13 19:23:09 +02:00
Matthew Honnibal	b693d2d224	Fix speed report in table	2020-09-13 17:39:31 +02:00
Sofie Van Landeghem	744df9814a	define threshold for scoring textcat in TextCat config (#6055 ) * define threshold for scoring textcat in TextCat config * fix unit test and documentation	2020-09-13 14:15:52 +02:00
Adriane Boyd	ab270364f1	Modify Token.morph to enable unsetting (#6043 ) Modify `Token.morph` property so that `Token.c.morph` can be reset back to an internal value of `0`. Allow setting `Token.morph` from a hash as long as the morph string is already in the `StringStore`, setting it indirectly through `Token.morph_` so that the value is added to the morphology. If the hash is not in the `StringStore`, raise an error.	2020-09-13 14:06:07 +02:00
Adriane Boyd	c7bd631b5f	Fix token.idx for special cases with affixes (#6035 )	2020-09-13 14:05:36 +02:00
Matthew Honnibal	54c40223a1	Improve v3 pretrain command (#6040 ) * Starts to run * Update pretrain script * Update corpus * Update pretrain schema * Remove outdated test * Make JsonlTexts produce Example objects.	2020-09-13 14:05:05 +02:00
Ines Montani	febb99916d	Tidy up and auto-format [ci skip]	2020-09-13 10:55:36 +02:00
Ines Montani	a5633b205f	Fix handling of errors around git [ci skip]	2020-09-13 10:52:28 +02:00
Ines Montani	f8846c198d	Update types and docstrings	2020-09-13 10:52:02 +02:00
Sofie Van Landeghem	e92e850c72	Raise if empty examples (#6052 ) * raise error if no valid Example objects were found during initialization * fix max_length parameter * remove commit from other branch Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-09-12 21:01:53 +02:00
Matthew Honnibal	37347830d4	Fix reading in GloVe vectors	2020-09-12 17:31:18 +02:00
Ines Montani	b41be87213	Merge pull request #6051 from svlandeg/feature/cli-config	2020-09-12 17:12:35 +02:00
Ines Montani	eedaaaec75	Fix handling of existing asset without checksum [ci skip]	2020-09-12 17:02:53 +02:00
svlandeg	a75cfe0da6	Merge remote-tracking branch 'upstream/develop' into feature/cli-config	2020-09-12 14:44:40 +02:00
svlandeg	115147804a	string_to_list to parse comma-separated string into a list	2020-09-12 14:43:22 +02:00
Ines Montani	f886f5bbc8	Merge pull request #6048 from explosion/fix/clone-compat	2020-09-12 10:30:49 +02:00
svlandeg	711166a75a	prevent overwriting score_weights	2020-09-11 15:12:05 +02:00
Ines Montani	62eec33bc4	Fix meta.json validation	2020-09-11 11:38:33 +02:00
Ines Montani	0b2e07215d	Support overwriting name on spacy package	2020-09-11 11:38:28 +02:00
svlandeg	5b94aeece9	support pipeline as "list in string"	2020-09-11 11:08:46 +02:00
Ines Montani	1bce432b4a	Adjust message [ci skip]	2020-09-11 10:00:49 +02:00
Ines Montani	5acd4fbcd8	Merge branch 'develop' into fix/clone-compat	2020-09-11 09:58:30 +02:00
Ines Montani	761bd60d43	Adjust info message	2020-09-11 09:57:00 +02:00
Ines Montani	6831161bfa	Resolve path to be extra sure	2020-09-11 09:56:49 +02:00
svlandeg	1723fb73c4	remove brol	2020-09-10 17:44:59 +02:00
svlandeg	08a831ce83	process trailing slash if any	2020-09-10 17:39:52 +02:00
Ines Montani	3e83a509bb	WIP: fix project clone compatibility	2020-09-10 15:49:13 +02:00
svlandeg	f1bc09c1e9	restore partly	2020-09-10 14:53:02 +02:00
svlandeg	3889747119	asset fix & UX	2020-09-10 14:36:53 +02:00
svlandeg	a36766d153	hookup branch	2020-09-10 12:00:34 +02:00
svlandeg	97d99f7efa	Merge remote-tracking branch 'upstream/develop' into feature/doc-fixes	2020-09-10 11:51:34 +02:00
Ines Montani	908f3a4494	Update default projects repo [ci skip]	2020-09-10 11:42:14 +02:00
svlandeg	92f9d2f406	small UX fixes	2020-09-10 11:35:50 +02:00
svlandeg	1fc5486792	more fine-grained errors for git_sparse_checkout	2020-09-10 11:31:32 +02:00
Ines Montani	15bc3a37b4	Add --branch to project clone	2020-09-10 11:08:15 +02:00
Ines Montani	1955aaaa20	Merge pull request #6045 from svlandeg/feature/more-layers-docs [ci skip]	2020-09-09 21:46:40 +02:00
Sofie Van Landeghem	cb66ea7400	Remove simple_ner code (#6041 ) * remove simple_ner code * remove unused _biluo and _iob files	2020-09-09 16:11:27 +02:00
svlandeg	39aa740777	Merge remote-tracking branch 'upstream/develop' into feature/more-layers-docs	2020-09-09 11:59:34 +02:00
Sofie Van Landeghem	8e7557656f	Renaming gold & annotation_setter (#6042 ) * version bump to 3.0.0a16 * rename "gold" folder to "training" * rename 'annotation_setter' to 'set_extra_annotations' * formatting	2020-09-09 10:31:03 +02:00
Sofie Van Landeghem	60f22e1800	Pipe API (#6034 ) * ensure Language passes on valid examples for initialization * fix tagger model initialization * check for valid get_examples across components * assume labels were added before begin_training * fix senter initialization * fix morphologizer initialization * use methods to check arguments * test textcat init, requires thinc>=8.0.0a31 * fix tok2vec init * fix entity linker init * use islice * fix simple NER * cleanup debug model * fix assert statements * fix tests * throw error when adding a label if the output layer can't be resized anymore * fix test * add failing test for simple_ner * UX improvements * morphologizer UX * assume begin_training gets a representative set and processes the labels * remove assumptions for output of untrained NER model * restore test for original purpose	2020-09-08 22:44:25 +02:00
svlandeg	d0a8849e4d	fix typo	2020-09-08 18:32:12 +02:00
svlandeg	bd8f9b188b	small fixes	2020-09-08 17:24:36 +02:00
Matthew Honnibal	4b82882767	Fix defaults	2020-09-08 15:31:21 +02:00
Matthew Honnibal	5d09e3e154	Set version to v3.0.0a15	2020-09-08 15:25:10 +02:00
Matthew Honnibal	ba5f4c9b32	Add words and seconds to train info	2020-09-08 15:24:47 +02:00
Matthew Honnibal	b470062153	Add CLI registry (#6037 )	2020-09-08 15:23:34 +02:00
svlandeg	06ef66fd73	Merge remote-tracking branch 'upstream/develop' into feature/more-layers-docs	2020-09-08 10:28:42 +02:00
Matthew Honnibal	dae22f3dfa	Fix ignoring of punct labels	2020-09-05 14:11:59 +02:00
Matthew Honnibal	12e1279f6b	Set version to v3.0.0a14	2020-09-05 04:13:53 +02:00
Matthew Honnibal	4b7abaafdb	Fix learn rate for non-transformer	2020-09-04 21:22:50 +02:00
Matthew Honnibal	465785a672	Fix project pull and push	2020-09-04 21:15:55 +02:00
Ines Montani	f174c7b1f3	Merge branch 'develop' into pr/6018	2020-09-04 15:54:49 +02:00
Ines Montani	f06eed800e	Merge pull request #6029 from explosion/master-tmp	2020-09-04 15:11:55 +02:00
Ines Montani	f9550b4493	Fix components in meta.json and website [ci skip]	2020-09-04 14:42:12 +02:00
Ines Montani	d7cc2ee72d	Fix tests	2020-09-04 14:05:55 +02:00
Ines Montani	90043a6f9b	Tidy up and auto-format	2020-09-04 13:42:33 +02:00
Ines Montani	df0b68f60e	Remove unicode declarations and update language data	2020-09-04 13:19:16 +02:00
Ines Montani	ba600f91c5	Tidy up imports	2020-09-04 13:15:44 +02:00
Ines Montani	864a697e63	Merge branch 'develop' into master-tmp	2020-09-04 13:15:36 +02:00
Adriane Boyd	b927893309	Merge branch 'develop' into feature/dependency-matcher-v3	2020-09-04 13:03:30 +02:00
Ines Montani	ab1bb421ed	Update docs links in codebase	2020-09-04 12:58:50 +02:00
holubvl3	0a27fca557	Create examples.py (#5985 ) * Create examples.py * Create tag_map.py * Delete tag_map.py * Update examples.py formatting: add empty line Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2020-09-04 11:00:14 +02:00
Ines Montani	2189046869	Merge pull request #6024 from explosion/chore/registry-renaming	2020-09-04 10:54:10 +02:00
svlandeg	c32fcdf4c9	fix typo	2020-09-04 09:10:21 +02:00
Ines Montani	595f9dc2e4	Make displacy color registry consistent with others This was the only registry that expected the registered objects to be dictionaries instead of functions that return something. We can still support plain dicts but we should also support functions for consistency	2020-09-03 23:05:41 +02:00
Matthew Honnibal	1c07820681	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2020-09-03 18:54:21 +02:00
Matthew Honnibal	7be8a0516a	Fix project pull	2020-09-03 18:54:03 +02:00
Ines Montani	23b7d9cfa3	Prefix span getters	2020-09-03 17:37:06 +02:00
Ines Montani	5afe6447cd	registry.assets -> registry.misc	2020-09-03 17:31:14 +02:00
Ines Montani	c063e55eb7	Add prefix to batchers	2020-09-03 17:30:41 +02:00
Ines Montani	896caf45e3	Merge pull request #6023 from explosion/ux/model-terminology-consistency [ci skip]	2020-09-03 17:13:44 +02:00
Ines Montani	c53b1433b9	Adjust more arguments [ci skip]	2020-09-03 17:12:24 +02:00
Ines Montani	b5a0657fd6	"model" terminology consistency in docs	2020-09-03 13:13:03 +02:00
Matthew Honnibal	f038841798	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2020-09-03 12:52:39 +02:00
Matthew Honnibal	ef0d0630a4	Let Langugae.use_params work with falsey inputs The Language.use_params method was failing if you passed in None, which meant we had to use awkward conditionals for the parameter averaging. This solves the problem.	2020-09-03 12:51:04 +02:00
Yohei Tamura	5af432e0f2	fix for empty string (#5936 )	2020-09-03 10:09:03 +02:00
Adriane Boyd	77ac4a38aa	Simplify specials and cache checks (#6012 )	2020-09-03 09:42:49 +02:00
Adriane Boyd	8b5594df86	Remove near-duplicate test	2020-09-02 20:32:01 +02:00

1 2 3 4 5 ...

7886 Commits