spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-07-14 18:22:27 +03:00

Author	SHA1	Message	Date
Matthew Honnibal	6dcc4a0ba6	Simplify MultiHashEmbed signature	2020-10-05 19:57:45 +02:00
Matthew Honnibal	7d93575f35	spacy/tests/	2020-10-05 15:28:12 +02:00
Matthew Honnibal	f4ca9a39cb	spacy/tests/	2020-10-05 15:27:06 +02:00
Matthew Honnibal	f2f1deca66	spacy/tests/	2020-10-05 15:24:33 +02:00
Matthew Honnibal	8ec79ad3fa	Allow configuration of MultiHashEmbed features Update arguments to MultiHashEmbed layer so that the attributes can be controlled. A kind of tricky scheme is used to allow optional specification of the rows. I think it's an okay balance between flexibility and convenience.	2020-10-05 15:22:00 +02:00
Adriane Boyd	5d19dfc9d3	Update Chinese tokenizer for spacy-pkuseg fork	2020-10-05 14:21:53 +02:00
Ines Montani	6958510bda	Include spaCy version check in project CLI	2020-10-05 13:53:07 +02:00
Ines Montani	20f2a17a09	Merge test_misc and test_util	2020-10-05 13:45:57 +02:00
Ines Montani	1c641e41c3	Remove unused import [ci skip]	2020-10-05 11:50:11 +02:00
Adriane Boyd	b0b93854cb	Update ru/uk lemmatizers for new nlp.initialize	2020-10-05 09:27:16 +02:00
Ines Montani	549758f67d	Adjust test for now	2020-10-04 23:16:09 +02:00
Ines Montani	3c36a57e84	Update data augmenters (#6196 ) * Draft lower-case augmenter * Make warning a debug log * Update lowercase augmenter, docs and tests Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-10-04 17:46:29 +02:00
Ines Montani	496228771d	Merge pull request #6194 from explosion/master-tmp	2020-10-04 15:25:41 +02:00
Ines Montani	0307a228c8	Merge pull request #6193 from explosion/fix/adjust-pipe-init Adjust [initialize.components] on Language.remove_pipe and Language.rename_pipe	2020-10-04 15:20:54 +02:00
Ines Montani	59deeb7da6	Merge branch 'develop' into master-tmp	2020-10-04 14:52:20 +02:00
Ines Montani	8f018e47f8	Adjust [initialize.components] on Language.remove_pipe and Language.rename_pipe	2020-10-04 14:43:45 +02:00
Ines Montani	11347f34da	Tidy up, tests and docs	2020-10-04 13:54:05 +02:00
Ines Montani	d3b3663942	Adjust error message and add test	2020-10-04 10:11:27 +02:00
Ines Montani	2110e8f86d	Auto-format	2020-10-04 10:06:49 +02:00
Matthew Honnibal	835070cedc	Upd test	2020-10-03 19:35:10 +02:00
Ines Montani	c2401fca41	Add tests for Pipe.label_data	2020-10-03 19:12:46 +02:00
Ines Montani	3bc3c05fcc	Tidy up and auto-format	2020-10-03 17:20:18 +02:00
Ines Montani	7c4ab7e82c	Fix Lemmatizer.get_lookups_config	2020-10-03 17:16:10 +02:00
Ines Montani	dd542ec6a4	Fix label initialization of textcat component (#6190 )	2020-10-03 17:07:38 +02:00
Sofie Van Landeghem	09dcb75076	small UX fix for DocBin (#6167 ) * add informative warning when messing up store_user_data DocBin flags * add informative warning when messing up store_user_data DocBin flags * cleanup test * rename to patterns_path	2020-10-02 15:43:32 +02:00
Ines Montani	f0b30aedad	Make lemmatizers use initialize logic (#6182 ) * Make lemmatizer use initialize logic and tidy up * Fix typo * Raise for uninitialized tables	2020-10-02 15:42:36 +02:00
Ines Montani	d2aa662ab2	Merge pull request #6179 from adrianeboyd/feature/token-morph-refactor-2 [ci skip]	2020-10-02 12:10:27 +02:00
Ines Montani	c41a4332e4	Add test for custom data augmentation	2020-10-02 11:37:56 +02:00
Adriane Boyd	f83dfe62da	Fix test	2020-10-02 10:17:26 +02:00
Ines Montani	01c1538c72	Integrate file readers	2020-10-02 01:36:06 +02:00
Adriane Boyd	86c3ec9c2b	Refactor Token morph setting (#6175 ) * Refactor Token morph setting * Remove `Token.morph_` * Add `Token.set_morph()` * `0` resets `token.c.morph` to unset * Any other values are passed to `Morphology.add` * Add token.morph setter to set from MorphAnalysis	2020-10-01 22:21:46 +02:00
Ines Montani	d48ddd6c9a	Remove default initialize lookups	2020-10-01 21:54:33 +02:00
Adriane Boyd	73538782a0	Switch Doc.__init__(ents=) to IOB tags (#6173 ) * Switch Doc.__init__(ents=) to IOB tags * Fix check for "-" * Allow "" or None as missing IOB tag	2020-10-01 16:22:18 +02:00
Yohei Tamura	3243ddac8f	Fix/span.sent (#6083 ) * add fail test * fix test * fix span.sent * Remove incorrect implicit check Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2020-10-01 14:01:52 +02:00
Ines Montani	381258b75b	Merge pull request #6165 from explosion/feature/update-tokenizers-initialize	2020-10-01 09:49:47 +02:00
Ines Montani	a103ab5f1a	Update augmenter lookups and docs	2020-09-30 23:03:47 +02:00
Ines Montani	23c63eefaf	Tidy up env vars [ci skip]	2020-09-30 15:15:11 +02:00
Adriane Boyd	6b7bb32834	Refactor Chinese initialization	2020-09-30 11:46:45 +02:00
Ines Montani	34f9c26c62	Add lexeme norm defaults	2020-09-30 10:20:14 +02:00
Ines Montani	1aeef3bfbb	Make corpus paths default to None and improve errors	2020-09-29 22:33:46 +02:00
Ines Montani	fa47f87924	Tidy up and auto-format	2020-09-29 21:39:28 +02:00
Ines Montani	6467a560e3	WIP: Test updating Chinese tokenizer	2020-09-29 21:10:22 +02:00
Ines Montani	78021089f9	Merge pull request #6160 from explosion/feature/prepare	2020-09-29 20:55:13 +02:00
Ines Montani	c3f8c09d7d	Merge pull request #6154 from adrianeboyd/bugfix/chinese-tokenizer-pickle	2020-09-29 20:54:59 +02:00
Ines Montani	d3c63b7965	Merge branch 'develop' into feature/prepare	2020-09-29 20:53:05 +02:00
Ines Montani	2be80379ec	Fix small issues, resolve_dot_names and debug model	2020-09-29 20:38:35 +02:00
Ines Montani	7851020653	Update tests	2020-09-29 18:14:15 +02:00
Ines Montani	f2352eb701	Test with default value	2020-09-29 17:00:40 +02:00
Ines Montani	63d1598137	Simplify config use in Language.initialize	2020-09-29 16:05:48 +02:00
Ines Montani	56f8bc73ef	Add more tests	2020-09-29 15:23:34 +02:00
Ines Montani	591038b1a4	Add test	2020-09-29 12:54:52 +02:00
Matthew Honnibal	e1fdf2b7c5	Upd tests	2020-09-29 12:05:38 +02:00
Ines Montani	ff9a63bfbd	begin_training -> initialize	2020-09-28 21:35:09 +02:00
Ines Montani	2e9c9e74af	Fix config resolution and interpolation TODO: auto-interpolate in Thinc if config is dict (i.e. likely subsection)	2020-09-28 15:34:00 +02:00
Ines Montani	822ea4ef61	Refactor CLI	2020-09-28 15:09:59 +02:00
Matthew Honnibal	a976da168c	Support data augmentation in Corpus (#6155 ) * Support data augmentation in Corpus * Note initial docs for data augmentation * Add augmenter to quickstart * Fix flake8 * Format * Fix test * Update spacy/tests/training/test_training.py * Improve data augmentation arguments * Update templates * Move randomization out into caller * Refactor * Update spacy/training/augment.py * Update spacy/tests/training/test_training.py * Fix augment * Fix test	2020-09-28 03:03:27 +02:00
Ines Montani	9016d23cc5	Fix exclude and add test	2020-09-27 23:34:03 +02:00
Ines Montani	7e938ed63e	Update config resolution to use new Thinc	2020-09-27 22:21:31 +02:00
Adriane Boyd	8393dbedad	Minor fixes * Put `cfg` back in serialization * Add `pickle5` to pytest conf	2020-09-27 15:15:53 +02:00
Adriane Boyd	11e195d3ed	Update ChineseTokenizer * Allow `pkuseg_model` to be set to `None` on initialization * Don't save config within tokenizer * Force convert pkuseg_model to use pickle protocol 4 by reencoding with `pickle5` on serialization * Update pkuseg serialization test	2020-09-27 14:00:18 +02:00
Ines Montani	ca3c997062	Improve CLI config validation with latest Thinc	2020-09-26 13:13:57 +02:00
Adriane Boyd	3c062b3911	Add MORPH handling to Matcher (#6107 ) * Add MORPH handling to Matcher * Add `MORPH` to `Matcher` schema * Rename `_SetMemberPredicate` to `_SetPredicate` * Add `ISSUBSET` and `ISSUPERSET` operators to `_SetPredicate` * Add special handling for normalization and conversion of morph values into sets * For other attrs, `ISSUBSET` acts like `IN` and `ISSUPERSET` only matches for 0 or 1 values * Update test * Rename to IS_SUBSET and IS_SUPERSET	2020-09-24 16:55:09 +02:00
Adriane Boyd	59340606b7	Add option to disable Matcher errors (#6125 ) * Add option to disable Matcher errors * Add option to disable Matcher errors when a doc doesn't contain a particular type of annotation Minor additional change: * Update `AttributeRuler.load_from_morph_rules` to allow direct `MORPH` values * Rename suppress_errors to allow_missing Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com> * Refactor annotation checks in Matcher and PhraseMatcher Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-09-24 16:54:39 +02:00
Sofie Van Landeghem	c7eedd3534	updates to NEL functionality (#6132 ) * NEL: read sentences and ents from reference * fiddling with sent_start annotations * add KB serialization test * KB write additional file with strings.json * score_links function to calculate NEL P/R/F * formatting * documentation	2020-09-24 16:53:59 +02:00
Ines Montani	d0ef4a4cf5	Prevent division by zero in score weights	2020-09-24 16:42:13 +02:00
Ines Montani	58dde293ce	Merge pull request #6089 from adrianeboyd/feature/doc-ents-v3-2	2020-09-24 14:44:42 +02:00
Adriane Boyd	8eaacaae97	Refactor Doc.ents setter to use Doc.set_ents Additional changes: * Entity spans with missing labels are ignored * Fix ent_kb_id setting in `Doc.set_ents`	2020-09-24 12:36:51 +02:00
Ines Montani	c6c67b606e	Merge pull request #6133 from explosion/fix/score_weights	2020-09-24 12:00:57 +02:00
Ines Montani	4bbe41f017	Fix combined scores and update test	2020-09-24 10:42:47 +02:00
Sofie Van Landeghem	c645c4e7ce	fix micro PRF for textcat (#6130 ) * fix micro PRF for textcat * small fix	2020-09-24 10:31:17 +02:00
Ines Montani	ae51f580c1	Fix handling of score_weights	2020-09-24 10:27:33 +02:00
svlandeg	b816ace4bb	format	2020-09-23 17:33:13 +02:00
svlandeg	5a9fdbc8ad	state_type as Literal	2020-09-23 17:32:14 +02:00
svlandeg	dd2292793f	'parser' instead of 'deps' for state_type	2020-09-23 16:53:49 +02:00
svlandeg	6c85fab316	state_type and extra_state_tokens instead of nr_feature_tokens	2020-09-23 13:35:09 +02:00
Ines Montani	60a317520a	Merge pull request #6109 from svlandeg/feature/2rename	2020-09-23 09:47:12 +02:00
Sofie Van Landeghem	86a08f819d	tok2vec.update instead of predict (#6113 )	2020-09-22 21:54:52 +02:00
Adriane Boyd	e4acb28658	Fix norm in retokenizer split (#6111 ) Parallel to behavior in merge, reset norm on original token in retokenizer split.	2020-09-22 21:53:33 +02:00
Sofie Van Landeghem	e0e793be4d	fix KB IO (#6118 )	2020-09-22 21:53:06 +02:00
Sofie Van Landeghem	d53c84b6d6	avoid None callback (#6100 )	2020-09-22 13:54:44 +02:00
Adriane Boyd	535842e483	Merge branch 'develop' into feature/doc-ents-v3-2	2020-09-22 13:45:50 +02:00
Ines Montani	5e3b796b12	Validate section refs in debug config	2020-09-22 12:24:39 +02:00
svlandeg	e1b8090b9b	few more fixes	2020-09-22 12:01:06 +02:00
svlandeg	b556a10808	rename converts in_to_out	2020-09-22 11:50:19 +02:00
Ines Montani	beb766d0a0	Add test	2020-09-22 09:15:57 +02:00
Ines Montani	69f7e52c26	Update README.md	2020-09-22 09:10:06 +02:00
Ines Montani	67fbcb3da5	Tidy up tests and docs	2020-09-21 20:43:54 +02:00
Ines Montani	a5f6ab4943	Merge pull request #6098 from adrianeboyd/feature/doc-init	2020-09-21 18:35:20 +02:00
Adriane Boyd	f212303729	Add sent_starts to Doc.__init__ Add sent_starts to `Doc.__init__`. Officially specify `is_sent_start` values but also convert to and accept `sent_start` internally.	2020-09-21 17:59:09 +02:00
Adriane Boyd	177df15d89	Implement Doc.set_ents	2020-09-21 15:54:05 +02:00
Adriane Boyd	13fbf6556a	Merge remote-tracking branch 'upstream/develop' into feature/doc-ents-v3-2	2020-09-21 14:42:04 +02:00
Adriane Boyd	ce455f30ca	Fix formatting	2020-09-21 13:53:29 +02:00
Adriane Boyd	bc02e86494	Extend Doc.__init__ with additional annotation Mostly copying from `spacy.tests.util.get_doc`, add additional kwargs to `Doc.__init__` to initialize the most common doc/token values.	2020-09-21 13:36:24 +02:00
Ines Montani	758ead8a47	Sync overrides with CLI overrides	2020-09-21 12:50:13 +02:00
Ines Montani	5497acf49a	Support config overrides via environment variables	2020-09-21 11:25:10 +02:00
Ines Montani	1114219ae3	Tidy up and auto-format	2020-09-21 10:59:07 +02:00
Adriane Boyd	eed4b785f5	Load vocab lookups tables at beginning of training Similar to how vectors are handled, move the vocab lookups to be loaded at the start of training rather than when the vocab is initialized, since the vocab doesn't have access to the full config when it's created. The option moves from `nlp.load_vocab_data` to `training.lookups`. Typically these tables will come from `spacy-lookups-data`, but any `Lookups` object can be provided. The loading from `spacy-lookups-data` is now strict, so configs for each language should specify the exact tables required. This also makes it easier to control whether the larger clusters and probs tables are included. To load `lexeme_norm` from `spacy-lookups-data`: ``` [training.lookups] @misc = "spacy.LoadLookupsData.v1" lang = ${nlp.lang} tables = ["lexeme_norm"] ```	2020-09-18 15:59:16 +02:00
Ines Montani	a127fa475e	Merge pull request #6078 from svlandeg/fix/corpus	2020-09-18 14:44:21 +02:00
Adriane Boyd	a88106e852	Remove W106: HEAD and SENT_START in doc.from_array (#6086 ) * Remove W106: HEAD and SENT_START in doc.from_array This warning was hacky and being triggered too often. * Fix test	2020-09-18 03:01:29 +02:00
Adriane Boyd	8b650f3a78	Modify setting missing and blocked entity tokens In order to make it easier to construct `Doc` objects as training data, modify how missing and blocked entity tokens are set to prioritize setting `O` and missing entity tokens for training purposes over setting blocked entity tokens. * `Doc.ents` setter sets tokens outside entity spans to `O` regardless of the current state of each token * For `Doc.ents`, setting a span with a missing label sets the `ent_iob` to missing instead of blocked * `Doc.block_ents(spans)` marks spans as hard `O` for use with the `EntityRecognizer`	2020-09-17 21:27:42 +02:00

1 2 3 4 5 ...

2048 Commits