spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-09-16 00:52:38 +03:00

Author	SHA1	Message	Date
Adriane Boyd	eed4b785f5	Load vocab lookups tables at beginning of training Similar to how vectors are handled, move the vocab lookups to be loaded at the start of training rather than when the vocab is initialized, since the vocab doesn't have access to the full config when it's created. The option moves from `nlp.load_vocab_data` to `training.lookups`. Typically these tables will come from `spacy-lookups-data`, but any `Lookups` object can be provided. The loading from `spacy-lookups-data` is now strict, so configs for each language should specify the exact tables required. This also makes it easier to control whether the larger clusters and probs tables are included. To load `lexeme_norm` from `spacy-lookups-data`: ``` [training.lookups] @misc = "spacy.LoadLookupsData.v1" lang = ${nlp.lang} tables = ["lexeme_norm"] ```	2020-09-18 15:59:16 +02:00
Ines Montani	0406200a1e	Update docs [ci skip]	2020-09-18 15:13:13 +02:00
Ines Montani	a127fa475e	Merge pull request #6078 from svlandeg/fix/corpus	2020-09-18 14:44:21 +02:00
Matthew Honnibal	bbdb5f62b7	Temporary work-around for scoring a subset of components (#6090 ) * Try hacking the scorer to work around sentence boundaries * Upd scorer * Set dev version * Upd scorer hack * Fix version * Improve comment on hack	2020-09-18 14:26:42 +02:00
Ines Montani	d32ce121be	Fix docs [ci skip]	2020-09-18 13:41:12 +02:00
Adriane Boyd	a88106e852	Remove W106: HEAD and SENT_START in doc.from_array (#6086 ) * Remove W106: HEAD and SENT_START in doc.from_array This warning was hacky and being triggered too often. * Fix test	2020-09-18 03:01:29 +02:00
svlandeg	e4fc7e0222	fixing output sample to proper 2D array	2020-09-17 22:34:36 +02:00
Adriane Boyd	8b650f3a78	Modify setting missing and blocked entity tokens In order to make it easier to construct `Doc` objects as training data, modify how missing and blocked entity tokens are set to prioritize setting `O` and missing entity tokens for training purposes over setting blocked entity tokens. * `Doc.ents` setter sets tokens outside entity spans to `O` regardless of the current state of each token * For `Doc.ents`, setting a span with a missing label sets the `ent_iob` to missing instead of blocked * `Doc.block_ents(spans)` marks spans as hard `O` for use with the `EntityRecognizer`	2020-09-17 21:27:42 +02:00
Ines Montani	9062585a13	Merge pull request #6087 from explosion/docs/pretrain-usage [ci skip]	2020-09-17 19:25:24 +02:00
Ines Montani	a0b4389a38	Update docs [ci skip]	2020-09-17 19:24:48 +02:00
Matthew Honnibal	6efb7688a6	Draft pretrain usage	2020-09-17 18:17:03 +02:00
Sofie Van Landeghem	ed0fb034cb	ml_datasets v0.2.0a0	2020-09-17 18:11:10 +02:00
Ines Montani	1bb8b4f824	Merge branch 'master' into develop	2020-09-17 17:46:20 +02:00
Ines Montani	6bd0d25fb9	Merge pull request #6085 from explosion/docs/static-vectors-intro [ci skip]	2020-09-17 17:14:45 +02:00
Ines Montani	a2c8cda26f	Update docs [ci skip]	2020-09-17 17:12:51 +02:00
Ines Montani	2c80f41852	Merge pull request #6084 from svlandeg/feature/init-config-pretrain [ci skip]	2020-09-17 16:59:14 +02:00
Ines Montani	2e3ce9f42f	Merge branch 'feature/init-config-pretrain' of https://github.com/svlandeg/spaCy into pr/6084	2020-09-17 16:58:49 +02:00
Ines Montani	3d8e010655	Change order	2020-09-17 16:58:46 +02:00
Ines Montani	c4b414b282	Update website/docs/api/cli.md	2020-09-17 16:58:09 +02:00
Ines Montani	3865214343	Use consistent shortcut	2020-09-17 16:57:02 +02:00
Sofie Van Landeghem	e5ceec5df0	Update website/docs/api/cli.md Co-authored-by: Ines Montani <ines@ines.io>	2020-09-17 16:56:20 +02:00
Sofie Van Landeghem	127ce0c574	Update website/docs/api/cli.md Co-authored-by: Ines Montani <ines@ines.io>	2020-09-17 16:55:53 +02:00
Matthew Honnibal	ec751068f3	Draft text for static vectors intro	2020-09-17 16:42:53 +02:00
svlandeg	35a3931064	fix typo	2020-09-17 16:36:27 +02:00
svlandeg	5fade4feb7	fix cli abbrev	2020-09-17 16:15:20 +02:00
svlandeg	ddfc1fc146	add pretraining option to init config	2020-09-17 16:05:40 +02:00
svlandeg	3a3110ef60	remove empty files	2020-09-17 15:44:11 +02:00
svlandeg	c8c84f1ccd	Merge remote-tracking branch 'upstream/develop' into fix/corpus	2020-09-17 15:43:04 +02:00
svlandeg	130ffa5fbf	fix typos in docs	2020-09-17 14:59:41 +02:00
Matthew Honnibal	b57ce9a875	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2020-09-17 13:59:25 +02:00
Matthew Honnibal	30e85b2a42	Remove outdated configs	2020-09-17 13:59:12 +02:00
Ines Montani	c8fa2247e3	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2020-09-17 12:34:15 +02:00
Ines Montani	6761028c6f	Update docs [ci skip]	2020-09-17 12:34:11 +02:00
svlandeg	427dbecdd6	cleanup and formatting	2020-09-17 11:48:04 +02:00
svlandeg	0c35885751	generalize corpora, dot notation for dev and train corpus	2020-09-17 11:38:59 +02:00
svlandeg	8cedb2f380	Merge branch 'fix/corpus' of https://github.com/svlandeg/spaCy into fix/corpus	2020-09-17 09:27:55 +02:00
svlandeg	781fae678b	Merge remote-tracking branch 'upstream/develop' into fix/corpus	2020-09-17 09:24:36 +02:00
Sofie Van Landeghem	21dcf92964	Update website/docs/api/data-formats.md Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-09-17 09:21:36 +02:00
Matthew Honnibal	8303d101a5	Set version to v3.0.0a19	2020-09-17 00:18:49 +02:00
Adriane Boyd	7e4cd7575c	Refactor Docs.is_ flags (#6044 ) * Refactor Docs.is_ flags * Add derived `Doc.has_annotation` method * `Doc.has_annotation(attr)` returns `True` for partial annotation * `Doc.has_annotation(attr, require_complete=True)` returns `True` for complete annotation * Add deprecation warnings to `is_tagged`, `is_parsed`, `is_sentenced` and `is_nered` * Add `Doc._get_array_attrs()`, which returns a full list of `Doc` attrs for use with `Doc.to_array`, `Doc.to_bytes` and `Doc.from_docs`. The list is the `DocBin` attributes list plus `SPACY` and `LENGTH`. Notes on `Doc.has_annotation`: * `HEAD` is converted to `DEP` because heads don't have an unset state * Accept `IS_SENT_START` as a synonym of `SENT_START` Additional changes: * Add `NORM`, `ENT_ID` and `SENT_START` to default attributes for `DocBin` * In `Doc.from_array()` the presence of `DEP` causes `HEAD` to override `SENT_START` * In `Doc.from_array()` using `attrs` other than `Doc._get_array_attrs()` (i.e., a user's custom list rather than our default internal list) with both `HEAD` and `SENT_START` shows a warning that `HEAD` will override `SENT_START` * `set_children_from_heads` does not require dependency labels to set sentence boundaries and sets `sent_start` for all non-sentence starts to `-1` * Fix call to set_children_form_heads Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-09-17 00:14:01 +02:00
Adriane Boyd	a119667a36	Clean up spacy.tokens (#6046 ) * Clean up spacy.tokens * Update `set_children_from_heads`: * Don't check `dep` when setting lr_* or sentence starts * Set all non-sentence starts to `False` * Use `set_children_from_heads` in `Token.head` setter * Reduce similar/duplicate code (admittedly adds a bit of overhead) * Update sentence starts consistently * Remove unused `Doc.set_parse` * Minor changes: * Declare cython variables (to avoid cython warnings) * Clean up imports * Modify set_children_from_heads to set token range Modify `set_children_from_heads` so that it adjust tokens within a specified range rather then the whole document. Modify the `Token.head` setter to adjust only the tokens affected by the new head assignment.	2020-09-16 20:32:38 +02:00
Matthew Honnibal	c776594ab1	Fix	2020-09-16 18:15:14 +02:00
Matthew Honnibal	4a573d18b3	Add comment	2020-09-16 17:51:29 +02:00
Matthew Honnibal	d31afc8334	Fix Language.link_components when model is None	2020-09-16 17:49:48 +02:00
Adriane Boyd	f3db3f6fe0	Add vectors option to CharacterEmbed (#6069 ) * Add vectors option to CharacterEmbed * Update spacy/pipeline/morphologizer.pyx * Adjust default morphologizer config Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-09-16 17:45:04 +02:00
Adriane Boyd	d722a439aa	Remove unneeded methods in senter and morphologizer (#6074 ) Now that the tagger doesn't manage the tag map, the child classes senter and morphologizer don't need to override the serialization methods.	2020-09-16 17:39:41 +02:00
Adriane Boyd	87c329c711	Set rule-based lemmatizers as default (#6076 ) For languages without provided models and with lemmatizer rules in `spacy-lookups-data`, make the rule-based lemmatizer the default: Bengali, Persian, Norwegian, Swedish	2020-09-16 17:37:29 +02:00
svlandeg	0dc914b667	bump thinc to 8.0.0a33	2020-09-16 16:42:58 +02:00
svlandeg	1040e250d8	actual commit with test for custom readers with ml_datasets >= 0.2	2020-09-16 16:41:28 +02:00
svlandeg	714a5a05c6	test for custom readers with ml_datasets >= 0.2	2020-09-16 16:39:55 +02:00

... 3 4 5 6 7 ...

13314 Commits