spaCy

mirror of https://github.com/explosion/spaCy.git synced 2024-12-29 11:26:28 +03:00

Author	SHA1	Message	Date
Ines Montani	9b86312bab	Update docs [ci skip]	2020-08-29 18:43:19 +02:00
Adriane Boyd	870774f475	Merge branch 'develop' into docs/morph-usage-v3	2020-08-29 16:00:50 +02:00
Ines Montani	45f46a5c85	Merge pull request #5993 from explosion/feature/disabled-components	2020-08-29 15:58:41 +02:00
Adriane Boyd	f9ed31a757	Update usage docs for lemmatization and morphology	2020-08-29 15:56:50 +02:00
Ines Montani	bc0730be3f	Update docs [ci skip]	2020-08-29 12:53:14 +02:00
Ines Montani	450bf806b0	Merge pull request #5991 from adrianeboyd/docs/sent-usage-v3 Update sentence segmentation usage docs	2020-08-29 12:40:06 +02:00
Ines Montani	66d76f5126	Update docs	2020-08-29 12:36:05 +02:00
svlandeg	9f00a20ce4	proofreading and custom examples	2020-08-28 21:50:42 +02:00
svlandeg	5230529de2	add loggers registry & logger docs sections	2020-08-28 21:44:04 +02:00
Adriane Boyd	48df50533d	Update sentence segmentation usage docs Update sentence segmentation usage docs to incorporate `senter`.	2020-08-28 10:58:16 +02:00
svlandeg	72a87095d9	add loggers registry	2020-08-27 20:26:28 +02:00
svlandeg	aa9e0c9c39	small fix	2020-08-27 19:56:52 +02:00
svlandeg	8cde6ccb7d	Merge remote-tracking branch 'upstream/develop' into feature/vectors-docs	2020-08-27 19:56:09 +02:00
svlandeg	556e975a30	various fixes	2020-08-27 19:24:44 +02:00
Ines Montani	ff4175e839	Add more info to debug config	2020-08-27 18:17:58 +02:00
svlandeg	329e490560	small import fixes	2020-08-27 14:50:43 +02:00
svlandeg	28e4ba7270	fix references to TransformerListener	2020-08-27 14:33:28 +02:00
svlandeg	4d37ac3f33	configure_custom_sent_spans example	2020-08-27 14:14:16 +02:00
svlandeg	c68169f83f	fix link	2020-08-27 10:19:43 +02:00
svlandeg	acc794c975	example of writing to other custom attribute	2020-08-27 10:10:10 +02:00
svlandeg	559b65f2e0	adjust references to null_annotation_setter to trfdata_setter	2020-08-27 09:43:32 +02:00
Ines Montani	696f167478	Add diff example to docs [ci skip]	2020-08-26 15:57:54 +02:00
Adriane Boyd	90d88729e0	Add AttributeRuler.score (#5963 ) * Add AttributeRuler.score Add scoring for TAG / POS / MORPH / LEMMA if these are present in the assigned token attributes. Add default score weights (that don't really make a lot of sense) so that the scores are in the default config in some form. * Update docs	2020-08-26 15:39:30 +02:00
svlandeg	ec069627fe	rename to TransformerListener	2020-08-26 13:31:01 +02:00
Ines Montani	627617a079	Tidy up and add docs [ci skip]	2020-08-26 13:24:55 +02:00
svlandeg	15902c5aa2	fix link	2020-08-26 11:51:57 +02:00
svlandeg	feb86d5206	clarify default	2020-08-26 11:21:30 +02:00
Ines Montani	f31c4462ca	Update docs [ci skip]	2020-08-25 13:27:59 +02:00
Ines Montani	8ac5ef1284	Update docs	2020-08-25 11:54:37 +02:00
Matthew Honnibal	8038b87f04	Various small tweaks to project CLI (#5965 ) * Fix up/download of http and local paths * Support git_sparse_checkout for assets * Fix scorer * Handle already-present directories for git assets * Improve convert command * Fix support for existant files in git assets * Support branches in git sparse checkout * Format * Fix git assets * Document git block in assets * Fix test * Fix test * Revert "Fix test" This reverts commit `cf3097260f`. * Revert "Fix test" This reverts commit `964d636e27`. * Dont multiply p/r/f by 100 * Display scores * 100 during training	2020-08-25 00:30:52 +02:00
Ines Montani	26405710e0	Add icon credit [ci skip]	2020-08-24 10:28:15 +02:00
Matthew Honnibal	e559867605	Allow spacy project to push and pull to/from remote storage (#5949 ) * Add utils for working with remote storage * WIP add remote_cache for project * WIP add push and pull commands * Use pathy in remote_cache * Updarte util * Update remote_cache * Update util * Update project assets * Update pull script * Update push script * Fix type annotation in util * Work on remote storage * Remove site and env hash * Fix imports * Fix type annotation * Require pathy * Require pathy * Fix import * Add a util to handle project variable substitution * Import push and pull commands * Fix pull command * Fix push command * Fix tarfile in remote_storage * Improve printing * Fiddle with status messages * Set version to v3.0.0a9 * Draft docs for spacy project remote storages * Update docs [ci skip] * Use Thinc config to simplify and unify template variables * Auto-format * Don't import Pathy globally for now Causes slow and annoying Google Cloud warning * Tidy up test * Tidy up and update tests * Update to latest Thinc * Update docs * variables -> vars * Update docs [ci skip] * Update docs [ci skip] Co-authored-by: Ines Montani <ines@ines.io>	2020-08-23 18:32:09 +02:00
Ines Montani	f27aecac14	Update formatting [ci skip]	2020-08-23 11:57:56 +02:00
Ines Montani	98a9e063b6	Update docs [ci skip]	2020-08-22 17:15:05 +02:00
Matthew Honnibal	8dfc4cbfe7	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2020-08-22 17:12:09 +02:00
Matthew Honnibal	048de64d4c	Suggest edits	2020-08-22 17:11:28 +02:00
Ines Montani	adcf790b96	Update docs[ci skip]	2020-08-22 17:04:16 +02:00
Ines Montani	37ebff6997	Update docs [ci skip]	2020-08-22 16:47:03 +02:00
Matthew Honnibal	8685229891	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2020-08-22 16:06:59 +02:00
Matthew Honnibal	d97695d09d	Update embeddings-transformers.md	2020-08-22 15:41:35 +02:00
Ines Montani	c7c9b0451f	Update docs [ci skip]	2020-08-22 13:52:52 +02:00
Ines Montani	71aeae89c5	Merge pull request #5948 from svlandeg/feature/docs-docs-docs [ci skip]	2020-08-22 12:18:47 +02:00
Ines Montani	27f81109d6	Update docs [ci skip]	2020-08-21 20:02:18 +02:00
Ines Montani	f102164a1f	Update docs [ci skip]	2020-08-21 19:34:06 +02:00
svlandeg	1b7cfa7347	Merge remote-tracking branch 'upstream/develop' into feature/docs-docs-docs	2020-08-21 18:36:18 +02:00
svlandeg	942adf0f4d	comma	2020-08-21 18:36:02 +02:00
svlandeg	262552010d	context manager with space (for consistency)	2020-08-21 18:34:02 +02:00
svlandeg	da48c6a2a2	several small updates	2020-08-21 18:25:26 +02:00
svlandeg	ad2332d4b7	alphabetize registries	2020-08-21 18:10:31 +02:00
svlandeg	dc98f69b57	alphabetize registries	2020-08-21 18:10:21 +02:00
svlandeg	c6659e37d8	small fixes	2020-08-21 18:02:20 +02:00
svlandeg	518a1f97f3	remove outdated TODO's	2020-08-21 17:55:15 +02:00
svlandeg	e92bd6e1c1	alphabetize training lists	2020-08-21 17:42:19 +02:00
Ines Montani	2cc4640385	Update docs [ci skip]	2020-08-21 16:21:55 +02:00
Ines Montani	74cb6d39d0	Update docs [ci skip]	2020-08-21 16:11:38 +02:00
Matthew Honnibal	f5bcc10268	Update architectures	2020-08-21 15:34:54 +02:00
Matthew Honnibal	7ed8f4504b	Update API docs for architectures	2020-08-21 15:22:19 +02:00
Ines Montani	aa6a7cd6e7	Update docs and consistency [ci skip]	2020-08-21 13:49:18 +02:00
Ines Montani	52bd3a8b48	Update docs [ci skip]	2020-08-21 13:22:59 +02:00
Ines Montani	e60442d83a	Adjust label casing in displaCy NER visualizer (resolves #4866 ) - Accept any case for label names in ents and colors option, even if actual predicted label uses different casing - Don't text-transform: uppercase visually, if it's important to users that the label is represented as-is in the UI	2020-08-21 11:51:31 +02:00
Ines Montani	04e4d59235	Update docs [ci skip]	2020-08-20 16:17:25 +02:00
Ines Montani	7f2e4244df	Merge pull request #5941 from svlandeg/feature/update-more-docs	2020-08-20 11:21:24 +02:00
Ines Montani	6ad59d59fe	Merge branch 'develop' of https://github.com/explosion/spaCy into develop [ci skip]	2020-08-20 11:20:58 +02:00
Ines Montani	fb51b55eb9	Add comment [ci skip]	2020-08-20 11:20:43 +02:00
Sofie Van Landeghem	410b54e10e	Update website/docs/api/data-formats.md Co-authored-by: Ines Montani <ines@ines.io>	2020-08-20 11:15:34 +02:00
svlandeg	ae719b354f	fix typos	2020-08-20 10:20:40 +02:00
svlandeg	f728c00cbb	Merge remote-tracking branch 'upstream/develop' into feature/update-more-docs # Conflicts: # website/docs/api/data-formats.md	2020-08-20 10:02:13 +02:00
svlandeg	229033831a	add explanation of raw_text	2020-08-20 10:00:45 +02:00
Ines Montani	2253d26b82	Update vectors and similarity docs [ci skip]	2020-08-19 21:18:26 +02:00
Ines Montani	ea6640ea72	Merge pull request #5939 from explosion/feature/thinc-v8.0.0a28 Update Thinc and config variables	2020-08-19 21:14:36 +02:00
Ines Montani	15e6feed01	Update docs [ci skip]	2020-08-19 20:37:54 +02:00
svlandeg	09f3cfc985	add version	2020-08-19 19:58:45 +02:00
svlandeg	7d9f00bdbf	waltzing schedule	2020-08-19 19:53:00 +02:00
Ines Montani	3dd390b1a1	Update Thinc and config variables	2020-08-19 19:46:12 +02:00
svlandeg	85b39639e1	small fix	2020-08-19 19:17:36 +02:00
svlandeg	d8f6abdc23	add linking TODO back in	2020-08-19 18:00:35 +02:00
svlandeg	169b5bcda0	Merge remote-tracking branch 'upstream/develop' into feature/update-docs # Conflicts: # website/docs/usage/training.md	2020-08-19 17:58:25 +02:00
svlandeg	7119295a8a	badgers intro	2020-08-19 17:53:22 +02:00
svlandeg	4906a2ae6c	custom functions intro	2020-08-19 17:32:35 +02:00
svlandeg	7a2e6a96f5	fix typo	2020-08-19 16:54:16 +02:00
svlandeg	648499157a	rename "custom models" to "custom functions"	2020-08-19 16:53:51 +02:00
Ines Montani	63921161c8	Update docs [ci skip]	2020-08-19 16:04:21 +02:00
svlandeg	d3a8321172	fix typos	2020-08-19 15:12:12 +02:00
svlandeg	60fedb8518	fix 2 more API lines	2020-08-19 14:55:32 +02:00
svlandeg	2dfd919585	add kb_loader and get_candidates back to EL API	2020-08-19 14:52:49 +02:00
Ines Montani	225f8866a1	Fix consistency	2020-08-19 12:47:57 +02:00
Ines Montani	9c25656ccc	Update docs [ci skip]	2020-08-19 12:14:41 +02:00
Ines Montani	2285e59765	Merge pull request #5933 from svlandeg/feature/more-v3-docs [ci skip]	2020-08-19 11:29:02 +02:00
Ines Montani	13291e97ba	Update docs [ci skip]	2020-08-19 00:28:37 +02:00
svlandeg	6ed67d495a	format	2020-08-18 19:43:20 +02:00
svlandeg	f9fe5eb323	clean up example	2020-08-18 19:35:23 +02:00
svlandeg	a8acedd4ba	example of custom reader and batcher	2020-08-18 19:15:16 +02:00
svlandeg	0d55b6ebb4	formatting	2020-08-18 18:55:56 +02:00
svlandeg	abba639565	Merge remote-tracking branch 'upstream/develop' into feature/more-v3-docs	2020-08-18 18:55:12 +02:00
Sofie Van Landeghem	358cbb21e3	Define candidate generator in EL config (#5876 ) * candidate generator as separate part of EL config * update comment * ent instead of str as input for candidate generation * Span instead of str: correct type indication * fix types * unit test to create new candidate generator * fix replace_pipe argument passing * move error message, general cleanup * add vocab back to KB constructor * provide KB as callable from Vocab arg * rename to kb_loader, fix KB serialization as part of the EL pipe * fix typo * reformatting * cleanup * fix comment * fix wrongly duplicated code from merge conflict * rename dump to to_disk * from_disk instead of load_bulk * update test after recent removal of set_morphology in tagger * remove old doc	2020-08-18 16:10:36 +02:00
Ines Montani	82f0e20318	Update docs and consistency [ci skip]	2020-08-18 14:39:40 +02:00
Matthew Honnibal	b72bd1767f	Remove todo	2020-08-18 13:52:22 +02:00
Matthew Honnibal	574fd53289	Add precision/recall description	2020-08-18 13:51:08 +02:00
Matthew Honnibal	96a9c65f97	Add model architectures intro	2020-08-18 13:50:55 +02:00
svlandeg	705e1cb06c	typo in link	2020-08-18 12:04:05 +02:00
svlandeg	f7b76d2d83	Merge remote-tracking branch 'upstream/develop' into feature/more-v3-docs	2020-08-18 11:57:52 +02:00
svlandeg	8dcda351ec	typo's and quick note on default values	2020-08-18 10:23:27 +02:00
Ines Montani	ef6cf3b276	Update docs [ci skip]	2020-08-18 01:29:34 +02:00
Ines Montani	1c3bcfb488	Update docs and util consistency	2020-08-18 01:22:59 +02:00
Ines Montani	728fec0194	Update docs [ci skip]	2020-08-18 00:49:19 +02:00
Ines Montani	9299166c75	Merge pull request #5925 from explosion/docs/vectors [ci skip] Update the 'vectors' docs page	2020-08-17 21:45:09 +02:00
Ines Montani	990c6b4c32	Update docs and CLI [ci skip]	2020-08-17 21:38:20 +02:00
svlandeg	4fe4bab1c9	typo fixes	2020-08-17 17:10:15 +02:00
svlandeg	da80c18660	merge develop into branch	2020-08-17 16:57:18 +02:00
Ines Montani	3ae5e02f4f	Update docs, types and API consistency	2020-08-17 16:45:24 +02:00
Matthew Honnibal	052d82aa4e	Suggest vectors changes	2020-08-17 15:32:30 +02:00
svlandeg	961e818be6	p/r definitions	2020-08-17 15:02:39 +02:00
svlandeg	319692aa53	fix typos	2020-08-17 14:05:48 +02:00
Matthew Honnibal	61dfdd9fbd	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2020-08-16 20:30:01 +02:00
Matthew Honnibal	be07567ac6	Update transformers page	2020-08-16 20:29:50 +02:00
Matthew Honnibal	8e5f99ee25	Update transformer docs intro. Also write system requirements	2020-08-16 20:13:24 +02:00
Ines Montani	2ac4b0ef3e	Finish Transformer docs [ci skip]	2020-08-16 15:56:32 +02:00
Ines Montani	6ae83bde0c	Fix CLI consistency [ci skip]	2020-08-16 15:46:29 +02:00
Ines Montani	a570c304df	Update quickstart, template and docs	2020-08-15 14:50:29 +02:00
Ines Montani	950832f087	Tidy up pipes (#5906 ) * Tidy up pipes * Fix init, defaults and raise custom errors * Update docs * Update docs [ci skip] * Apply suggestions from code review Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com> * Tidy up error handling and validation, fix consistency * Simplify get_examples check * Remove unused import [ci skip] Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-08-11 23:29:31 +02:00
Ines Montani	b7ec06e331	Update docs [ci skip]	2020-08-11 20:57:23 +02:00
Ines Montani	10f42e3a39	Update docs [ci skip]	2020-08-11 00:09:49 +02:00
Ines Montani	2778d04377	Update docs [ci skip]	2020-08-10 23:41:09 +02:00
Ines Montani	adf2b1c8a9	Update graphic [ci skip]	2020-08-10 17:20:04 +02:00
Ines Montani	023ba7ae26	Update docs	2020-08-10 17:13:11 +02:00
Ines Montani	c099f6eece	Add Token.lex	2020-08-10 16:43:52 +02:00
Ines Montani	64f2f84098	Update docstrings and docs [ci skip]	2020-08-10 13:45:22 +02:00
Ines Montani	12052bd8f6	Update docs [ci skip]	2020-08-10 01:20:10 +02:00
Ines Montani	0832cdd443	Fix formatting [ci skip]	2020-08-10 00:46:32 +02:00
Ines Montani	d611cbef43	Update docs [ci skip]	2020-08-10 00:42:26 +02:00
Ines Montani	c044460823	Update docs [ci skip]	2020-08-10 00:01:38 +02:00
Ines Montani	05dcab10aa	Fix typo	2020-08-09 22:34:03 +02:00
Ines Montani	d5c78c7a34	Update docs and fix consistency	2020-08-09 22:31:52 +02:00
Ines Montani	a15c5fb191	Update docstrings and docs	2020-08-09 16:10:48 +02:00
Ines Montani	8d2baa153d	Update tokenizer docs and add test	2020-08-09 15:24:01 +02:00
Ines Montani	46bc513a4e	Update docs [ci skip]	2020-08-07 20:14:31 +02:00
Ines Montani	fe29ceec9e	Merge branch 'develop' into docs/model-docstrings	2020-08-07 18:42:01 +02:00
Ines Montani	470b6f8073	Update docs	2020-08-07 18:41:15 +02:00
Ines Montani	3901b088ff	Update graphics and 101 [ci skip]	2020-08-07 17:14:13 +02:00
Ines Montani	5e1421e5a6	Update docs [ci skip]	2020-08-07 16:23:12 +02:00
Ines Montani	b7e34c1451	Update docs [ci skip]	2020-08-07 16:13:13 +02:00
Ines Montani	6f3649923c	Merge pull request #5893 from explosion/feature/validate-arg	2020-08-07 15:47:20 +02:00
Ines Montani	e829d3bf14	Update docs [ci skip]	2020-08-07 15:46:20 +02:00
Adriane Boyd	e962784531	Add Lemmatizer and simplify related components (#5848 ) * Add Lemmatizer and simplify related components * Add `Lemmatizer` pipe with `lookup` and `rule` modes using the `Lookups` tables. * Reduce `Tagger` to a simple tagger that sets `Token.tag` (no pos or lemma) * Reduce `Morphology` to only keep track of morph tags (no tag map, lemmatizer, or morph rules) * Remove lemmatizer from `Vocab` * Adjust many many tests Differences: * No default lookup lemmas * No special treatment of TAG in `from_array` and similar required * Easier to modify labels in a `Tagger` * No extra strings added from morphology / tag map * Fix test * Initial fix for Lemmatizer config/serialization * Adjust init test to be more generic * Adjust init test to force empty Lookups * Add simple cache to rule-based lemmatizer * Convert language-specific lemmatizers Convert language-specific lemmatizers to component lemmatizers. Remove previous lemmatizer class. * Fix French and Polish lemmatizers * Remove outdated UPOS conversions * Update Russian lemmatizer init in tests * Add minimal init/run tests for custom lemmatizers * Add option to overwrite existing lemmas * Update mode setting, lookup loading, and caching * Make `mode` an immutable property * Only enforce strict `load_lookups` for known supported modes * Move caching into individual `_lemmatize` methods * Implement strict when lang is not found in lookups * Fix tables/lookups in make_lemmatizer * Reallow provided lookups and allow for stricter checks * Add lookups asset to all Lemmatizer pipe tests * Rename lookups in lemmatizer init test * Clean up merge * Refactor lookup table loading * Add helper from `load_lemmatizer_lookups` that loads required and optional lookups tables based on settings provided by a config. Additional slight refactor of lookups: * Add `Lookups.set_table` to set a table from a provided `Table` * Reorder class definitions to be able to specify type as `Table` * Move registry assets into test methods * Refactor lookups tables config Use class methods within `Lemmatizer` to provide the config for particular modes and to load the lookups from a config. * Add pipe and score to lemmatizer * Simplify Tagger.score * Add missing import * Clean up imports and auto-format * Remove unused kwarg * Tidy up and auto-format * Update docstrings for Lemmatizer Update docstrings for Lemmatizer. Additionally modify `is_base_form` API to take `Token` instead of individual features. * Update docstrings * Remove tag map values from Tagger.add_label * Update API docs * Fix relative link in Lemmatizer API docs	2020-08-07 15:27:13 +02:00
Adriane Boyd	4aecccf153	Update API docs for AttributeRuler.__init__	2020-08-07 15:17:25 +02:00
Ines Montani	a8404c3517	validation -> validate	2020-08-07 14:43:47 +02:00
Ines Montani	1d01d89b79	Update CLI docs and evaluate command [ci skip]	2020-08-07 14:40:58 +02:00
Ines Montani	ef2c67cca5	Add DocBin to/from_disk methods and update docs (#5892 ) * Add DocBin to/from_disk methods and update docs * Use DocBin.from_disk in Corpus	2020-08-07 14:30:59 +02:00
Ines Montani	4ca08c6d5d	Merge pull request #5891 from adrianeboyd/docs/attribute-ruler-api Add AttributeRuler API docs	2020-08-07 13:55:12 +02:00
Adriane Boyd	b8d0c23857	Add AttributeRuler API docs With additional minor updates to AttributeRuler docstrings.	2020-08-07 12:43:23 +02:00
svlandeg	824f4b2107	casing consistent	2020-08-06 23:20:13 +02:00
svlandeg	b17db0e994	Merge remote-tracking branch 'upstream/develop' into feature/el-docs # Conflicts: # website/docs/usage/training.md	2020-08-06 19:48:52 +02:00
svlandeg	49ddeb99ea	add textcat architectures documentation	2020-08-06 19:44:47 +02:00
Ines Montani	e5995904d6	Update docs	2020-08-06 19:30:43 +02:00
svlandeg	e8fd0c1f1e	EL architectures documentation	2020-08-06 17:41:26 +02:00
svlandeg	f396f091dc	update EL API	2020-08-06 16:40:48 +02:00
svlandeg	81d0b1c390	update EL pipe arguments	2020-08-06 16:22:50 +02:00
svlandeg	0b4d1e1bc4	'debug data' instead of 'debug-data'	2020-08-06 15:47:31 +02:00
svlandeg	881e3f8fd0	add docbin explanation and example	2020-08-06 15:29:44 +02:00
Ines Montani	5d417d3b19	WIP: Update docs [ci skip]	2020-08-06 13:10:15 +02:00
Ines Montani	06e80d95cd	Sync develop with nightly docs state (#5883 ) Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>	2020-08-06 00:28:14 +02:00
Ines Montani	5cc0d89fad	Simplify config overrides in CLI and deserialization (#5880 )	2020-08-05 23:35:09 +02:00
Ines Montani	50311a4d37	Update docs [ci skip]	2020-08-05 20:29:53 +02:00
Ines Montani	2a4d56e730	Update docs	2020-08-05 15:01:00 +02:00
Ines Montani	cdec46493f	Update docs	2020-08-05 15:00:54 +02:00
Adriane Boyd	c62fd878a3	Allow Doc.char_span to snap to token boundaries (#5849 ) * Allow Doc.char_span to snap to token boundaries Add a `mode` option to allow `Doc.char_span` to snap to token boundaries. The `mode` options: * `strict`: character offsets must match token boundaries (default, same as before) * `inside`: all tokens completely within the character span * `outside`: all tokens at least partially covered by the character span Add a new helper function `token_by_char` that returns the token corresponding to a character position in the text. Update `token_by_start` and `token_by_end` to use `token_by_char` for more efficient searching. * Remove unused import * Rename mode to alignment_mode Rename `mode` to `alignment_mode` with the options `strict`/`contract`/`expand`. Any unrecognized modes are silently converted to `strict`.	2020-08-04 13:36:32 +02:00
Ines Montani	4c055f0aa7	Add init CLI and init config (#5854 ) * Add init CLI and init config draft * Improve config validation * Auto-format * Don't export anything in debug config * Update docs	2020-08-02 15:18:30 +02:00
Ines Montani	b40f44419b	Simplify pipe analysis - remove unused code - don't print by default - integrate attrs info into analysis output	2020-08-01 13:40:06 +02:00
Ines Montani	98c6a85c8b	Update docs [ci skip]	2020-07-31 18:55:38 +02:00
Ines Montani	e9e8fa2466	Update docs and types	2020-07-31 17:02:54 +02:00
Ines Montani	6365837ca9	Merge pull request #5833 from explosion/feature/scorer-adjustments	2020-07-31 14:00:39 +02:00
Ines Montani	5a221f79c2	Revert "Remove keyword-only from Scorer API docs" [ci skip] This reverts commit `7a6ac47dc1`.	2020-07-31 14:00:21 +02:00
Ines Montani	160f1a5f94	Update docs [ci skip]	2020-07-31 13:26:39 +02:00
Adriane Boyd	9b509aa87f	Move Language.evaluate scorer config to new arg Move `Language.evaluate` scorer config from `component_cfg` to separate argument `scorer_cfg`.	2020-07-31 11:05:16 +02:00
Adriane Boyd	9d79916792	Merge branch 'develop' into feature/scorer-adjustments	2020-07-31 10:48:14 +02:00
Ines Montani	3449c45fd9	Update docs [ci skip]	2020-07-29 19:48:26 +02:00
Ines Montani	9c80cb673d	Update docs [ci skip]	2020-07-29 19:41:34 +02:00
Ines Montani	9f69afdd1e	Update docs [ci skip]	2020-07-29 19:09:44 +02:00
Ines Montani	7a21775cd0	Merge pull request #5834 from explosion/feature/vectors	2020-07-29 18:49:26 +02:00
Ines Montani	6a5c853edb	Fix docs [ci skip]	2020-07-29 18:45:12 +02:00
Ines Montani	158d8c1e48	Update docs [ci skip]	2020-07-29 18:44:10 +02:00
Matthew Honnibal	f7adc9d3b7	Start rewriting vectors docs	2020-07-29 17:10:06 +02:00
Ines Montani	b0f57a0cac	Update docs and consistency	2020-07-29 15:14:07 +02:00
Ines Montani	e0ffe36e79	Update docstrings, docs and types	2020-07-29 11:36:42 +02:00
Adriane Boyd	7a6ac47dc1	Remove keyword-only from Scorer API docs	2020-07-29 10:40:30 +02:00
Ines Montani	ac24adec73	Small adjustments to Scorer and docs	2020-07-28 21:39:42 +02:00
Ines Montani	256b24b720	Update arch docs WIP [ci skip]	2020-07-28 20:33:52 +02:00
Ines Montani	ae4d8a6ffd	Update docstrings, docs and pipe consistency	2020-07-28 13:37:31 +02:00
Ines Montani	0094cb0d04	Remove scores list from config and document	2020-07-28 11:22:24 +02:00
Ines Montani	894e20c466	Merge branch 'develop' into feature/component-scores	2020-07-27 18:14:39 +02:00
Ines Montani	d8b519c23c	API docs, docstrings and argument consistency	2020-07-27 18:11:45 +02:00
Ines Montani	10b84e1e27	Add flag to toggle sdist creation on package [ci skip]	2020-07-27 16:52:23 +02:00
Adriane Boyd	fdf09cb231	Update Scorer API docs for score_cats	2020-07-27 15:34:42 +02:00
Ines Montani	7dd53d0964	Fix typo [ci skip]	2020-07-27 00:34:00 +02:00
Ines Montani	7adbaf9a5b	Update docs [ci skip]	2020-07-27 00:29:45 +02:00
Matthew Honnibal	fb5dbe30b5	Trim training 101	2020-07-26 13:43:22 +02:00
Matthew Honnibal	e6a7deb7cc	Edits to the training 101 section	2020-07-26 13:42:08 +02:00
Ines Montani	c288dba8e7	Update docs [ci skip]	2020-07-25 18:51:12 +02:00
Ines Montani	eb9acae34d	Merge pull request #5791 from adrianeboyd/docs/morphology	2020-07-25 15:10:21 +02:00
Li Zhe	a69eb445dc	fix the wrong hash url in adding-languages.md file (#5810 ) * fix the wrong hash url in adding-languages.md file change the #101 url hash path to #language-data * filled in the spaCy Contributor Agreement filled in the spaCy Contributor Agreement	2020-07-25 13:13:38 +02:00
Adriane Boyd	2bcceb80c4	Refactor the Scorer to improve flexibility (#5731 ) * Refactor the Scorer to improve flexibility Refactor the `Scorer` to improve flexibility for arbitrary pipeline components. * Individual pipeline components provide their own `evaluate` methods that score a list of `Example`s and return a dictionary of scores * `Scorer` is initialized either: * with a provided pipeline containing components to be scored * with a default pipeline containing the built-in statistical components (senter, tagger, morphologizer, parser, ner) * `Scorer.score` evaluates a list of `Example`s and returns a dictionary of scores referring to the scores provided by the components in the pipeline Significant differences: * `tags_acc` is renamed to `tag_acc` to be consistent with `token_acc` and the new `morph_acc`, `pos_acc`, and `lemma_acc` * Scoring is no longer cumulative: `Scorer.score` scores a list of examples rather than a single example and does not retain any state about previously scored examples * PRF values in the returned scores are no longer multiplied by 100 * Add kwargs to Morphologizer.evaluate * Create generalized scoring methods in Scorer * Generalized static scoring methods are added to `Scorer` * Methods require an attribute (either on Token or Doc) that is used to key the returned scores Naming differences: * `uas`, `las`, and `las_per_type` in the scores dict are renamed to `dep_uas`, `dep_las`, and `dep_las_per_type` Scoring differences: * `Doc.sents` is now scored as spans rather than on sentence-initial token positions so that `Doc.sents` and `Doc.ents` can be scored with the same method (this lowers scores since a single incorrect sentence start results in two incorrect spans) * Simplify / extend hasattr check for eval method * Add hasattr check to tokenizer scoring * Simplify to hasattr check for component scoring * Reset Example alignment if docs are set Reset the Example alignment if either doc is set in case the tokenization has changed. * Add PRF tokenization scoring for tokens as spans Add PRF scores for tokens as character spans. The scores are: * token_acc: # correct tokens / # gold tokens * token_p/r/f: PRF for (token.idx, token.idx + len(token)) * Add docstring to Scorer.score_tokenization * Rename component.evaluate() to component.score() * Update Scorer API docs * Update scoring for positive_label in textcat * Fix TextCategorizer.score kwargs * Update Language.evaluate docs * Update score names in default config	2020-07-25 12:53:02 +02:00
Adriane Boyd	8f44584bef	Update MorphAnalysis.get and related examples	2020-07-23 08:51:31 +02:00
Adriane Boyd	941b9e33f7	Add Token.morph_	2020-07-22 17:59:45 +02:00
Ines Montani	be476e495e	Merge pull request #5787 from adrianeboyd/docs/morphologizer Initial draft of Morphologizer API docs	2020-07-22 17:16:57 +02:00
Adriane Boyd	d3385f4be2	Add Morphology and MorphAnalysis to overview	2020-07-21 13:06:22 +02:00
Adriane Boyd	fcd3a4abe3	Add morph to Token API docs	2020-07-21 13:05:58 +02:00
Adriane Boyd	14df00ae98	Add Morphology and MorphAnalsysis API docs Add initial draft of `Morphology` and `MorphAnalysis` API docs.	2020-07-21 10:33:46 +02:00
Ines Montani	644074b954	Merge branch 'develop' into master-tmp	2020-07-20 14:58:04 +02:00
Adriane Boyd	986f7e4d69	Initial draft of Morphologizer API docs	2020-07-20 12:53:02 +02:00
Adriane Boyd	39ebcd9ec9	Refactor Chinese tokenizer configuration (#5736 ) * Refactor Chinese tokenizer configuration Refactor `ChineseTokenizer` configuration so that it uses a single `segmenter` setting to choose between character segmentation, jieba, and pkuseg. * replace `use_jieba`, `use_pkuseg`, `require_pkuseg` with the setting `segmenter` with the supported values: `char`, `jieba`, `pkuseg` * make the default segmenter plain character segmentation `char` (no additional libraries required) * Fix Chinese serialization test to use char default * Warn if attempting to customize other segmenter Add a warning if `Chinese.pkuseg_update_user_dict` is called when another segmenter is selected.	2020-07-19 13:34:37 +02:00
Adriane Boyd	cd5af72c9a	Update pkuseg version (#5774 ) * Update pkuseg version in Chinese tokenizer warnings * Update pkuseg version in `Makefile` * Remove warning about python3.8 wheels in docs	2020-07-19 11:09:49 +02:00
Ines Montani	872938ec76	Merge pull request #5747 from explosion/feature/refactor-config-args	2020-07-14 00:00:22 +02:00
Ines Montani	5f6f4ff594	Remove object subclassing	2020-07-12 14:03:23 +02:00
Ines Montani	c96535e338	Update command docstrings and docs	2020-07-12 13:53:49 +02:00
Ines Montani	3f948b9c74	Update docs	2020-07-12 12:32:28 +02:00
Ines Montani	11bbc82c24	Update cli.md [ci skip]	2020-07-10 23:37:52 +02:00
Ines Montani	9455b060d2	Update cli.md	2020-07-10 22:57:22 +02:00
Ines Montani	7b5717cac3	Merge branch 'develop' into feature/refactor-config-args	2020-07-10 22:50:07 +02:00
Ines Montani	e6a6587a9a	Update projects.md [ci skip]	2020-07-10 22:41:27 +02:00
Ines Montani	f2cd982e7b	Update training.md	2020-07-10 22:34:27 +02:00
Ines Montani	52e9b5b472	Fix formatting	2020-07-09 23:25:58 +02:00
Ines Montani	28cdae898a	Update projects.md	2020-07-09 22:35:54 +02:00
Ines Montani	7bcf9f7cfb	Document new features	2020-07-09 21:10:36 +02:00
Ines Montani	ea01831f6a	Update projects docs etc.	2020-07-09 19:43:25 +02:00
Ines Montani	175d34d8f9	Update sidebar menu	2020-07-09 11:44:09 +02:00
Ines Montani	9ee5b71412	Update cli.md	2020-07-09 11:44:00 +02:00
Ines Montani	9ae4040183	Update API docs	2020-07-08 13:34:35 +02:00
svlandeg	c94279ac1b	remove tensors, fix predict, get_loss and set_annotations	2020-07-08 13:11:54 +02:00
svlandeg	90b100c39f	remove component.Model, update constructor, losses is return value of update	2020-07-08 12:14:30 +02:00
Ines Montani	2298e129e6	Update example and training docs	2020-07-07 20:30:12 +02:00
svlandeg	2b60e894cb	fix component constructors, update, begin_training, reference to GoldParse	2020-07-07 19:17:19 +02:00
svlandeg	14a796e3f9	add Example API with examples of Example usage	2020-07-07 14:46:41 +02:00
Ines Montani	bb3ee38cf9	Update WIP	2020-07-06 22:22:37 +02:00
Ines Montani	44da24ddd0	Update doc.md	2020-07-06 18:17:00 +02:00
Ines Montani	44790c1c32	Update docs and add keyword-only tag	2020-07-06 18:14:57 +02:00
Ines Montani	a35236e5f0	Update v3 docs WIP [ci skip]	2020-07-06 15:57:44 +02:00
Ines Montani	63247cbe87	Update v3 docs [ci skip]	2020-07-05 16:11:16 +02:00
Matthew Honnibal	3e78e82a83	Experimental character-based pretraining (#5700 ) * Use cosine loss in Cloze multitask * Fix char_embed for gpu * Call resume_training for base model in train CLI * Fix bilstm_depth default in pretrain command * Implement character-based pretraining objective * Use chars loss in ClozeMultitask * Add method to decode predicted characters * Fix number characters * Rescale gradients for mlm * Fix char embed+vectors in ml * Fix pipes * Fix pretrain args * Move get_characters_loss * Fix import * Fix import * Mention characters loss option in pretrain * Remove broken 'self attention' option in pretrain * Revert "Remove broken 'self attention' option in pretrain" This reverts commit `56b820f6af`. * Document 'characters' objective of pretrain	2020-07-05 15:48:39 +02:00
Ines Montani	dc8c9d912f	Update docs [ci skip]	2020-07-04 16:47:24 +02:00
Ines Montani	4498dfe99d	Update docs	2020-07-04 16:25:30 +02:00
Ines Montani	1e0d54edd1	Update docs	2020-07-04 14:23:10 +02:00
Ines Montani	fe224dc2dd	Merge branch 'develop' into nightly.spacy.io	2020-07-03 16:48:27 +02:00
Ines Montani	06f1ecb308	Update v3 docs	2020-07-03 16:48:21 +02:00
Ines Montani	cdf9ee1716	Add stub for Example API docs [ci skip]	2020-07-03 15:46:10 +02:00
Ines Montani	fa8e097c04	Update convert docs [ci skip]	2020-07-03 15:42:04 +02:00
Jan Jessewitsch	e4dcac4a4b	Merging multiple docs into one (#5032 ) * Add static method to Doc to allow merging of multiple docs. * Add error description for the error that occurs if docs with different vocabs (from different languages) are merged in Doc.from_docs(). * Add test for Doc.from_docs() implementation. * Fix using numpy's concatenate in Doc.from_docs. * Replace typing's type annotations in from_docs. * Simply remove type annotations in from_docs. * Add documentation for Doc.from_docs to api. * Simplify from_docs, its test and the api doc for codebase consistency. * Fix merging of Doc objects that end with whitespaces (Achieved by simply not setting the SPACY attribute on whitespace tokens). Remove two unnecessary imports of attributes. * Add merging of user data from Doc objects in from_docs. Add user data test case to corresponding test. Add applicable warning messages. * Fix incorrect setting of tokens idx by using concatenated spaces (again). Add test case to corresponding test. * Add MORPH to attrs * Update warnings calls * Remove out-dated error from merge * Rename space_delimiter to ensure_whitespace Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2020-07-03 11:32:42 +02:00
Adriane Boyd	a723fa02a1	DocBin: add version number, missing attributes and strings (#5685 ) * Add version number to DocBin Add a version number to DocBin for future use. * Add POS to all attributes in DocBin * Add morph string to strings in DocBin * Update DocBin API * Add string for ENT_KB_ID in DocBin	2020-07-02 17:41:50 +02:00
Ines Montani	b5268955d7	Update matcher usage examples [ci skip]	2020-07-02 15:39:45 +02:00
Ines Montani	a4cfe9fc33	Remove inline notes on v2 changes [ci skip]	2020-07-01 22:29:22 +02:00
Ines Montani	fe4cfd0632	Start updating website for v3 [ci skip]	2020-07-01 21:26:39 +02:00
Ines Montani	26df4efa94	Add new in v3.0	2020-07-01 13:02:17 +02:00
Ines Montani	18a900abc2	Fix markup	2020-07-01 13:02:07 +02:00
Ines Montani	414dc7ace1	Merge branch 'spacy.io' into spacy.io-develop	2020-07-01 11:47:47 +02:00
Álvaro Abella Bascarán	7111b9de2e	Fix in docs: pipe(docs) instead of pipe(texts) (#5680 ) Very minor fix in docs, specifically in this part: ``` matcher = PhraseMatcher(nlp.vocab) > for doc in matcher.pipe(texts, batch_size=50): > pass ``` `texts` suggests the input is an iterable of strings. I replaced it for `docs`.	2020-06-30 20:01:12 +02:00
Álvaro Abella Bascarán	ff0dbe5c64	Fix in docs: pipe(docs) instead of pipe(texts) (#5680 ) Very minor fix in docs, specifically in this part: ``` matcher = PhraseMatcher(nlp.vocab) > for doc in matcher.pipe(texts, batch_size=50): > pass ``` `texts` suggests the input is an iterable of strings. I replaced it for `docs`.	2020-06-30 20:00:50 +02:00
Matthias Hertel	305221f3e5	Website: fixed the token span in the text about the rule-based matching example (#5669 ) * fixed token span in pattern matcher example * contributor agreement	2020-06-30 19:58:55 +02:00
Matthias Hertel	8b0f749606	Website: fixed the token span in the text about the rule-based matching example (#5669 ) * fixed token span in pattern matcher example * contributor agreement	2020-06-30 19:58:23 +02:00
Adriane Boyd	d777d9cc38	Extend v2.3 migration guide (#5653 ) * Extend preloaded vocab section * Add section on tag maps	2020-06-26 14:13:01 +02:00
Adriane Boyd	c4d0209472	Extend v2.3 migration guide (#5653 ) * Extend preloaded vocab section * Add section on tag maps	2020-06-26 14:12:29 +02:00
Adriane Boyd	a2660bd9c6	Fix backslashes in warnings config diff (#5640 ) Fix backslashes in warnings config diff in v2.3 migration section.	2020-06-24 10:26:57 +02:00
Adriane Boyd	fd4287c178	Fix backslashes in warnings config diff (#5640 ) Fix backslashes in warnings config diff in v2.3 migration section.	2020-06-24 10:26:12 +02:00
Adriane Boyd	4f73ced914	Extend what's new in v2.3 with vocab / is_oov (#5635 )	2020-06-23 16:50:43 +02:00
Adriane Boyd	7ce451c211	Extend what's new in v2.3 with vocab / is_oov (#5635 )	2020-06-23 16:48:59 +02:00
Adriane Boyd	fcdecefacf	Add warnings example in v2.3 migration guide (#5627 )	2020-06-22 14:38:06 +02:00
Adriane Boyd	bc1cb30b21	Add warnings example in v2.3 migration guide (#5627 )	2020-06-22 14:37:24 +02:00
Ines Montani	52728d8fa3	Merge branch 'develop' into master-tmp	2020-06-20 15:52:00 +02:00
Adriane Boyd	66889de166	Warning for sudachipy 0.4.5 (#5611 )	2020-06-19 13:45:23 +02:00
Adriane Boyd	931d80de72	Warning for sudachipy 0.4.5 (#5611 )	2020-06-19 12:43:41 +02:00
Ines Montani	6d712f3e06	Merge pull request #5599 from adrianeboyd/docs/v2.3.0-minor	2020-06-16 13:49:25 -07:00
Adriane Boyd	02369f91d3	Fix spacy convert argument	2020-06-16 20:41:17 +02:00
Adriane Boyd	f0fd77648f	Change example title to Dr. Change example title to Dr. so the current model does exclude the title in the initial example.	2020-06-16 20:36:21 +02:00
Adriane Boyd	a6abdfbc3c	Fix numpy.zeros() dtype for Doc.from_array	2020-06-16 20:35:45 +02:00
Adriane Boyd	9aff317ca7	Update POS in tagging example	2020-06-16 20:26:57 +02:00
Adriane Boyd	457babfa0c	Update alignment example for new gold.align	2020-06-16 20:22:03 +02:00
Ines Montani	44af53bdd9	Add pkuseg warnings and auto-format [ci skip]	2020-06-16 17:13:35 +02:00
Ines Montani	a9e5b840ee	Fix typos and auto-format [ci skip]	2020-06-16 16:38:45 +02:00
Adriane Boyd	d5110ffbf2	Documentation updates for v2.3.0 (#5593 ) * Update website models for v2.3.0 * Add docs for Chinese word segmentation * Tighten up Chinese docs section * Merge branch 'master' into docs/v2.3.0 [ci skip] * Merge branch 'master' into docs/v2.3.0 [ci skip] * Auto-format and update version * Update matcher.md * Update languages and sorting * Typo in landing page * Infobox about token_match behavior * Add meta and basic docs for Japanese * POS -> TAG in models table * Add info about lookups for normalization * Updates to API docs for v2.3 * Update adding norm exceptions for adding languages * Add --omit-extra-lookups to CLI API docs * Add initial draft of "What's New in v2.3" * Add new in v2.3 tags to Chinese and Japanese sections * Add tokenizer to migration section * Add new in v2.3 flags to init-model * Typo * More what's new in v2.3 Co-authored-by: Ines Montani <ines@ines.io>	2020-06-16 15:37:35 +02:00
Sofie Van Landeghem	c0f4a1e43b	train is from-config by default (#5575 ) * verbose and tag_map options * adding init_tok2vec option and only changing the tok2vec that is specified * adding omit_extra_lookups and verifying textcat config * wip * pretrain bugfix * add replace and resume options * train_textcat fix * raw text functionality * improve UX when KeyError or when input data can't be parsed * avoid unnecessary access to goldparse in TextCat pipe * save performance information in nlp.meta * add noise_level to config * move nn_parser's defaults to config file * multitask in config - doesn't work yet * scorer offering both F and AUC options, need to be specified in config * add textcat verification code from old train script * small fixes to config files * clean up * set default config for ner/parser to allow create_pipe to work as before * two more test fixes * small fixes * cleanup * fix NER pickling + additional unit test * create_pipe as before	2020-06-12 02:02:07 +02:00
Sofie Van Landeghem	4d1ba6feb4	add tag variant for 2.3 (#5542 )	2020-06-04 19:16:33 +02:00
Ines Montani	810fce3bb1	Merge branch 'develop' into master-tmp	2020-06-03 14:36:59 +02:00
svlandeg	5f0a91cf37	fix conv-depth parameter	2020-05-29 09:56:29 +02:00
Ines Montani	262d306eaa	unicode -> str consistency	2020-05-24 17:23:00 +02:00
Ines Montani	5d3806e059	unicode -> str consistency	2020-05-24 17:20:58 +02:00
Jannis	aa53ce6996	Documentation Typo Fix (#5492 ) * Fix typo Change 'realize' to 'realise' * Add contributer agreement	2020-05-22 19:50:26 +02:00
Matthew Honnibal	f6078d866a	Merge pull request #5121 from adrianeboyd/bugfix/revert-token-match Revert token_match priority changes from #4374 and extend token match options	2020-05-22 14:42:51 +02:00
Ines Montani	65c7e82de2	Auto-format and remove 2.3 feature [ci skip]	2020-05-22 13:50:30 +02:00
Adriane Boyd	e4a1b5dab1	Rename to url_match Rename to `url_match` and update docs.	2020-05-22 12:41:03 +02:00
Adriane Boyd	730fa493a4	Merge remote-tracking branch 'upstream/master' into bugfix/revert-token-match	2020-05-22 12:18:00 +02:00
Ines Montani	24f72c669c	Merge branch 'develop' into master-tmp	2020-05-21 18:39:06 +02:00
Sofie Van Landeghem	0d94737857	Feature toggle_pipes (#5378 ) * make disable_pipes deprecated in favour of the new toggle_pipes * rewrite disable_pipes statements * update documentation * remove bin/wiki_entity_linking folder * one more fix * remove deprecated link to documentation * few more doc fixes * add note about name change to the docs * restore original disable_pipes * small fixes * fix typo * fix error number to W096 * rename to select_pipes * also make changes to the documentation Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-05-18 22:27:10 +02:00
Ines Montani	f333c2a011	Merge pull request #5386 from svlandeg/fix/nel-docs	2020-05-10 12:00:09 +02:00
adrianeboyd	4a15b559ba	Clarify Token.pos as UPOS (#5419 )	2020-05-08 10:36:25 +02:00
adrianeboyd	a2345618f1	Fix Token API docs from #5375 (#5418 )	2020-05-08 10:25:02 +02:00
Adriane Boyd	565e0eef73	Add tokenizer option for token match with affixes To fix the slow tokenizer URL (#4374) and allow `token_match` to take priority over prefixes and suffixes by default, introduce a new tokenizer option for a token match pattern that's applied after prefixes and suffixes but before infixes.	2020-05-05 10:35:33 +02:00
Adriane Boyd	792c8af8cf	Merge remote-tracking branch 'upstream/master' into bugfix/revert-token-match	2020-05-05 09:25:57 +02:00
svlandeg	ebaed7dcfa	Few more updates to the EL documentation	2020-04-30 10:17:06 +02:00
adrianeboyd	bdff76dede	Various updates/additions to CLI scripts (#5362 ) * `debug-data`: determine coverage of provided vectors * `evaluate`: support `blank:lg` model to make it possible to just evaluate tokenization * `init-model`: add option to truncate vectors to N most frequent vectors from word2vec file * `train`: * if training on GPU, only run evaluation/timing on CPU in the first iteration * if training is aborted, exit with a non-0 exit status	2020-04-29 12:56:46 +02:00
Sofie Van Landeghem	cfdaf99b80	Fix passing of component configuration (#5374 ) * add kwargs to to_disk methods in docs - otherwise crashes on 'exclude' argument * add fix and test for Issue 5137	2020-04-29 12:56:17 +02:00
Sofie Van Landeghem	f67343295d	Update NEL examples and documentation (#5370 ) * simplify creation of KB by skipping dim reduction * small fixes to train EL example script * add KB creation and NEL training example scripts to example section * update descriptions of example scripts in the documentation * moving wiki_entity_linking folder from bin to projects * remove test for wiki NEL functionality that is being moved	2020-04-29 12:53:53 +02:00
adrianeboyd	a6e521cd79	Add is_sent_end token property (#5375 ) Reconstruction of the original PR #4697 by @MiniLau. Removes unused `SENT_END` symbol and `IS_SENT_END` from `Matcher` schema because the Matcher is only going to be able to support `IS_SENT_START`.	2020-04-29 12:53:16 +02:00
adrianeboyd	90ce34db42	Add cuda101 and cuda102 options to setup (#5377 ) * Add cuda101 and cuda102 options to setup * Update cudaNNN options in docs	2020-04-29 12:51:12 +02:00
adrianeboyd	792aa7b6ab	Remove references to textcat spans (#5360 ) Remove references to unimplemented `TextCategorizer` span labels in `GoldParse` and `Doc`.	2020-04-27 18:01:12 +02:00
adrianeboyd	90c754024f	Update nlp.vectors to nlp.vocab.vectors (#5357 )	2020-04-27 10:53:05 +02:00
Mike	481574cbc8	[minor doc change] embedding vis. link is broken in `website/docs/usage/examples.md` (#5325 ) * The embedding vis. link is broken The first link seems to be reasonable for now unless someone has an updated embedding vis they want to share? * contributor agreement * Update Mlawrence95.md * Update website/docs/usage/examples.md Co-Authored-By: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2020-04-21 20:35:12 +02:00
laszabine	fb73d4943a	Amend documentation to Language.evaluate (#5319 ) * Specified usage of arguments to Language.evaluate * Created contributor agreement	2020-04-16 20:00:18 +02:00
Sofie Van Landeghem	a3965ec13d	tag-map-path since 2.2.4 instead of 2.2.3 (#5289 )	2020-04-14 14:53:47 +02:00
Marek Grzenkowicz	6a8a52650f	[Closes #5292 ] Fix typo in option name "--n-save_every" (#5293 ) * Sign contributor agreement for chopeen * Fix typo in option name and close #5292	2020-04-11 23:35:01 +02:00
Sofie Van Landeghem	1137420840	Small doc fixes (#5250 ) * fix link * torchtext instead tochtext	2020-04-03 13:01:43 +02:00
Nikhil Saldanha	d1ddfa1cb7	update docs for EntityRecognizer.predict return type was wrongly written as a tuple, changed to syntax.StateClass	2020-03-28 18:13:02 +01:00
Sofie Van Landeghem	9b412516e7	Fixing pickling of the parser (#5218 ) * fix __reduce__ for pickling parser * setting the move object as 'state' during pickling * unskip test_issue4725 - works again	2020-03-27 19:35:26 +01:00
Ines Montani	46568f40a7	Merge branch 'master' into tmp/sync	2020-03-26 13:38:14 +01:00
Tiljander	e53232533b	Describing priority rules for overlapping matches (#5197 ) * Describing priority rules for overlapping matches * Create Tiljander.md * Describing priority rules for overlapping matches * Update website/docs/api/entityruler.md Co-Authored-By: Ines Montani <ines@ines.io> Co-authored-by: Ines Montani <ines@ines.io>	2020-03-26 13:13:22 +01:00
adrianeboyd	d88a377bed	Remove Vectors.from_glove (#5209 )	2020-03-26 10:45:47 +01:00
Ines Montani	17bd9ed84f	Merge pull request #5153 from pinealan/fix/website-docs Fix website typos and weird sentences	2020-03-16 15:03:01 +01:00
Alan Chan	2124be100d	Tweak run-on sentence	2020-03-15 03:45:20 +08:00
Alan Chan	7c3a4ce933	Missing word in api/cli doc	2020-03-15 03:45:20 +08:00
Alan Chan	36e3532475	Remove unfinished sentence	2020-03-15 03:45:17 +08:00
Mark Abraham	a0ffa346c0	Fix broken link in docs	2020-03-13 14:07:26 +01:00
Ines Montani	c669435c62	Merge pull request #5125 from renaud/patch-1 small typo in code sample	2020-03-12 11:19:12 +01:00
svlandeg	1724a4f75b	additional information if doc is empty	2020-03-09 18:08:18 +01:00
Renaud Richardet	eccf6b1686	small typo in code sample	2020-03-09 14:49:11 +01:00
Adriane Boyd	0c31f03ec5	Update docs [ci skip]	2020-03-09 13:41:17 +01:00
Adriane Boyd	1139247532	Revert changes to token_match priority from #4374 * Revert changes to priority of `token_match` so that it has priority over all other tokenizer patterns * Add lookahead and potentially slow lookbehind back to the default URL pattern * Expand character classes in URL pattern to improve matching around lookaheads and lookbehinds related to #4882 * Revert changes to Hungarian tokenizer * Revert (xfail) several URL tests to their status before #4374 * Update `tokenizer.explain()` and docs accordingly	2020-03-09 12:09:41 +01:00
Ines Montani	1d6aec805d	Fix formatting and update docs for v2.2.4	2020-03-09 11:17:20 +01:00
Ines Montani	acb4e3c7ba	Merge pull request #5039 from adrianeboyd/typo/website-token-api-shape Fix formatting in Token API	2020-02-25 14:57:25 +01:00
Sofie Van Landeghem	479bd8d09f	add lemma option to displacy 'dep' visualiser (#5041 ) * add lemma option to displacy 'dep' visualiser * more compact list comprehension * add option to doc * fix test and add lemmas to util.get_doc * fix capital * remove lemma from get_doc * cleanup	2020-02-22 14:11:51 +01:00
Adriane Boyd	3853d385fa	Fix formatting in Token API	2020-02-20 13:41:24 +01:00
Ines Montani	de11ea753a	Merge branch 'master' into develop	2020-02-18 14:47:23 +01:00
Kabir Khan	f6ed07b85c	Use nlp.pipe in EntityRuler for phrase patterns in add_patterns (#4931 ) * Fix ent_ids and labels properties when id attribute used in patterns * use set for labels * sort end_ids for comparison in entity_ruler tests * fixing entity_ruler ent_ids test * add to set * Run make_doc optimistically if using phrase matcher patterns. * remove unused coveragerc I was testing with * format * Refactor EntityRuler.add_patterns to use nlp.pipe for phrase patterns. Improves speed substantially. * Removing old add_patterns function * Fixing spacing * Make sure token_patterns loaded as well, before generator was being emptied in from_disk	2020-02-16 18:17:47 +01:00
Julin S	479e81bafc	fix link (#4977 )	2020-02-10 20:31:26 -05:00
Ines Montani	9c08d9baa3	Remove old sections [ci skip] (closes #4961 )	2020-02-03 13:10:46 +01:00
Ines Montani	abd5c06374	Adjust formatting [ci skip]	2020-02-03 13:00:02 +01:00
Martin A. Kayser	02a44c5be2	Adding a note on retrieving the string rep of the match_id (#4904 ) Stolen from here: https://stackoverflow.com/questions/47638877/using-phrasematcher-in-spacy-to-find-multiple-match-types	2020-02-03 12:58:58 +01:00
adrianeboyd	7ad000fce7	Update docs for train CLI --use_gpu option (#4927 )	2020-01-20 17:02:47 +01:00
Preston Badeer	b216ff43c9	Update vectors-similarity.md (#4889 ) These links are broken on the website, due to quotes around the URLs.	2020-01-08 16:49:40 +01:00
Geoffrey Gordon Ashbrook	53929138d7	remove extra word typo (#4875 ) "let you find you"	2020-01-06 12:37:42 +01:00
Ines Montani	400257a802	Update index.md [ci skip]	2020-01-04 01:52:18 +01:00
Ivan Echevarria	ef13e0c038	Add n_process to Language.pipe documentation (#4842 ) [ci skip] * Add n_process to documentation * Auto-format and add default [ci skip] Co-authored-by: Ines Montani <ines@ines.io>	2019-12-29 14:23:33 +01:00
Ines Montani	db55577c45	Drop Python 2.7 and 3.5 (#4828 ) * Remove unicode declarations * Remove Python 3.5 and 2.7 from CI * Don't require pathlib * Replace compat helpers * Remove OrderedDict * Use f-strings * Set Cython compiler language level * Fix typo * Re-add OrderedDict for Table * Update setup.cfg * Revert CONTRIBUTING.md * Revert lookups.md * Revert top-level.md * Small adjustments and docs [ci skip]	2019-12-22 01:53:56 +01:00
Ines Montani	158b98a3ef	Merge branch 'master' into develop	2019-12-21 18:55:03 +01:00
Ines Montani	1b838d1313	Divide models into core and starters [ci skip]	2019-12-21 14:10:22 +01:00
Sofie Van Landeghem	8ebbb85117	Documentation for PhraseMatcher constructor (#4826 ) * add max_length as argument for init PhraseMatcher * improve error message too	2019-12-20 23:00:04 +01:00
Thiago Lages de Alencar	a067ded495	Update doc.md (#4796 )	2019-12-11 18:21:40 +01:00
Tclack88	ab8dc2732c	Update token.md (#4767 ) * Update token.md documentation is confusing: A '?' is a right punct, but '¿' is a left punct * Update token.md add quotations around parentheses in `is_left_punct` and `is_right_punct` for clarrification, ensuring the question mark that follows is not percieved as an example of left and right punctuation * Move quotes into code block [ci skip]	2019-12-06 19:22:02 +01:00
Ines Montani	bf611ebca7	Document jsonl option on converter [ci skip]	2019-12-06 19:17:45 +01:00
Nicolai Bjerre Pedersen	de5453cdcb	Fix link to user hooks in docs (#4778 ) * Fix link to user hooks in docs * Update mr_bjerre.md Mistake in contributor agreement * Apparently hard to get it right (wrong name of sca)	2019-12-06 19:17:12 +01:00
Ines Montani	cbacb0f1a4	Update shape docs and examples (resolves #4615 ) [ci skip]	2019-11-23 17:16:55 +01:00
Ines Montani	a6200bc424	Update scorer.md [ci skip]	2019-11-21 17:02:43 +01:00
Ines Montani	235fe6fe3b	Auto-format [ci skip]	2019-11-20 13:14:58 +01:00
adrianeboyd	2c876eb672	Add tokenizer explain() debugging method (#4596 ) * Expose tokenizer rules as a property Expose the tokenizer rules property in the same way as the other core properties. (The cache resetting is overkill, but consistent with `from_bytes` for now.) Add tests and update Tokenizer API docs. * Update Hungarian punctuation to remove empty string Update Hungarian punctuation definitions so that `_units` does not match an empty string. * Use _load_special_tokenization consistently Use `_load_special_tokenization()` and have it to handle `None` checks. * Fix precedence of `token_match` vs. special cases Remove `token_match` check from `_split_affixes()` so that special cases have precedence over `token_match`. `token_match` is checked only before infixes are split. * Add `make_debug_doc()` to the Tokenizer Add `make_debug_doc()` to the Tokenizer as a working implementation of the pseudo-code in the docs. Add a test (marked as slow) that checks that `nlp.tokenizer()` and `nlp.tokenizer.make_debug_doc()` return the same non-whitespace tokens for all languages that have `examples.sentences` that can be imported. * Update tokenization usage docs Update pseudo-code and algorithm description to correspond to `nlp.tokenizer.make_debug_doc()` with example debugging usage. Add more examples for customizing tokenizers while preserving the existing defaults. Minor edits / clarifications. * Revert "Update Hungarian punctuation to remove empty string" This reverts commit `f0a577f7a5`. * Rework `make_debug_doc()` as `explain()` Rework `make_debug_doc()` as `explain()`, which returns a list of `(pattern_string, token_string)` tuples rather than a non-standard `Doc`. Update docs and tests accordingly, leaving the visualization for future work. * Handle cases with bad tokenizer patterns Detect when tokenizer patterns match empty prefixes and suffixes so that `explain()` does not hang on bad patterns. * Remove unused displacy image * Add tokenizer.explain() to usage docs	2019-11-20 13:07:25 +01:00
Ines Montani	e8b9cee6fd	Make example consistent with model (closes #4587 ) [ci skip]	2019-11-18 12:41:48 +01:00
Ines Montani	e01a1a237f	Auto-format [ci skip]	2019-11-18 12:41:31 +01:00
adrianeboyd	62e00fd9da	Update tokenization usage docs (#4666 ) Update pseudo-code and algorithm description to correspond to current tokenizer behavior. Add more examples for customizing tokenizers while preserving the existing defaults. Minor edits / clarifications.	2019-11-18 12:35:13 +01:00
Ines Montani	5adcb352e9	Adjust order of docs sections [ci skip]	2019-11-17 16:08:56 +01:00
Ines Montani	e30d08410a	Add CI for Python 3.8 (#4479 ) * Add 3.8 classifier * Update azure-pipelines.yml * Remove 3.8 warning from docs [ci skip]	2019-11-15 01:13:48 +01:00
adrianeboyd	faaa832518	Generalize handling of tokenizer special cases (#4259 ) * Generalize handling of tokenizer special cases Handle tokenizer special cases more generally by using the Matcher internally to match special cases after the affix/token_match tokenization is complete. Instead of only matching special cases while processing balanced or nearly balanced prefixes and suffixes, this recognizes special cases in a wider range of contexts: * Allows arbitrary numbers of prefixes/affixes around special cases * Allows special cases separated by infixes Existing tests/settings that couldn't be preserved as before: * The emoticon '")' is no longer a supported special case * The emoticon ':)' in "example:)" is a false positive again When merged with #4258 (or the relevant cache bugfix), the affix and token_match properties should be modified to flush and reload all special cases to use the updated internal tokenization with the Matcher. * Remove accidentally added test case * Really remove accidentally added test * Reload special cases when necessary Reload special cases when affixes or token_match are modified. Skip reloading during initialization. * Update error code number * Fix offset and whitespace in Matcher special cases * Fix offset bugs when merging and splitting tokens * Set final whitespace on final token in inserted special case * Improve cache flushing in tokenizer * Separate cache and specials memory (temporarily) * Flush cache when adding special cases * Repeated `self._cache = PreshMap()` and `self._specials = PreshMap()` are necessary due to this bug: https://github.com/explosion/preshed/issues/21 * Remove reinitialized PreshMaps on cache flush * Update UD bin scripts * Update imports for `bin/` * Add all currently supported languages * Update subtok merger for new Matcher validation * Modify blinded check to look at tokens instead of lemmas (for corpora with tokens but not lemmas like Telugu) * Use special Matcher only for cases with affixes * Reinsert specials cache checks during normal tokenization for special cases as much as possible * Additionally include specials cache checks while splitting on infixes * Since the special Matcher needs consistent affix-only tokenization for the special cases themselves, introduce the argument `with_special_cases` in order to do tokenization with or without specials cache checks * After normal tokenization, postprocess with special cases Matcher for special cases containing affixes * Replace PhraseMatcher with Aho-Corasick Replace PhraseMatcher with the Aho-Corasick algorithm over numpy arrays of the hash values for the relevant attribute. The implementation is based on FlashText. The speed should be similar to the previous PhraseMatcher. It is now possible to easily remove match IDs and matches don't go missing with large keyword lists / vocabularies. Fixes #4308. * Restore support for pickling * Fix internal keyword add/remove for numpy arrays * Add test for #4248, clean up test * Improve efficiency of special cases handling * Use PhraseMatcher instead of Matcher * Improve efficiency of merging/splitting special cases in document * Process merge/splits in one pass without repeated token shifting * Merge in place if no splits * Update error message number * Remove UD script modifications Only used for timing/testing, should be a separate PR * Remove final traces of UD script modifications * Update UD bin scripts * Update imports for `bin/` * Add all currently supported languages * Update subtok merger for new Matcher validation * Modify blinded check to look at tokens instead of lemmas (for corpora with tokens but not lemmas like Telugu) * Add missing loop for match ID set in search loop * Remove cruft in matching loop for partial matches There was a bit of unnecessary code left over from FlashText in the matching loop to handle partial token matches, which we don't have with PhraseMatcher. * Replace dict trie with MapStruct trie * Fix how match ID hash is stored/added * Update fix for match ID vocab * Switch from map_get_unless_missing to map_get * Switch from numpy array to Token.get_struct_attr Access token attributes directly in Doc instead of making a copy of the relevant values in a numpy array. Add unsatisfactory warning for hash collision with reserved terminal hash key. (Ideally it would change the reserved terminal hash and redo the whole trie, but for now, I'm hoping there won't be collisions.) * Restructure imports to export find_matches * Implement full remove() Remove unnecessary trie paths and free unused maps. Parallel to Matcher, raise KeyError when attempting to remove a match ID that has not been added. * Switch to PhraseMatcher.find_matches * Switch to local cdef functions for span filtering * Switch special case reload threshold to variable Refer to variable instead of hard-coded threshold * Move more of special case retokenize to cdef nogil Move as much of the special case retokenization to nogil as possible. * Rewrap sort as stdsort for OS X * Rewrap stdsort with specific types * Switch to qsort * Fix merge * Improve cmp functions * Fix realloc * Fix realloc again * Initialize span struct while retokenizing * Temporarily skip retokenizing * Revert "Move more of special case retokenize to cdef nogil" This reverts commit `0b7e52c797`. * Revert "Switch to qsort" This reverts commit `a98d71a942`. * Fix specials check while caching * Modify URL test with emoticons The multiple suffix tests result in the emoticon `:>`, which is now retokenized into one token as a special case after the suffixes are split off. * Refactor _apply_special_cases() * Use cdef ints for span info used in multiple spots * Modify _filter_special_spans() to prefer earlier Parallel to #4414, modify _filter_special_spans() so that the earlier span is preferred for overlapping spans of the same length. * Replace MatchStruct with Entity Replace MatchStruct with Entity since the existing Entity struct is nearly identical. * Replace Entity with more general SpanC * Replace MatchStruct with SpanC * Add error in debug-data if no dev docs are available (see #4575) * Update azure-pipelines.yml * Revert "Update azure-pipelines.yml" This reverts commit `ed1060cf59`. * Use latest wasabi * Reorganise install_requires * add dframcy to universe.json (#4580) * Update universe.json [ci skip] * Fix multiprocessing for as_tuples=True (#4582) * Fix conllu script (#4579) * force extensions to avoid clash between example scripts * fix arg order and default file encoding * add example config for conllu script * newline * move extension definitions to main function * few more encodings fixes * Add load_from_docbin example [ci skip] TODO: upload the file somewhere * Update README.md * Add warnings about 3.8 (resolves #4593) [ci skip] * Fixed typo: Added space between "recognize" and "various" (#4600) * Fix DocBin.merge() example (#4599) * Replace function registries with catalogue (#4584) * Replace functions registries with catalogue * Update __init__.py * Fix test * Revert unrelated flag [ci skip] * Bugfix/dep matcher issue 4590 (#4601) * add contributor agreement for prilopes * add test for issue #4590 * fix on_match params for DependencyMacther (#4590) * Minor updates to language example sentences (#4608) * Add punctuation to Spanish example sentences * Combine multilanguage examples for lang xx * Add punctuation to nb examples * Always realloc to a larger size Avoid potential (unlikely) edge case and cymem error seen in #4604. * Add error in debug-data if no dev docs are available (see #4575) * Update debug-data for GoldCorpus / Example * Ignore None label in misaligned NER data	2019-11-13 21:24:35 +01:00
f11r	877971860e	Fix assert in sentencizer documentation. (#4639 )	2019-11-13 15:24:14 +01:00
Ines Montani	9d5ff177c4	Work around Markdown rendering issue surfaced in #4600 [ci skip]	2019-11-11 17:12:08 +01:00
adrianeboyd	0f8678c0b1	Fix DocBin.merge() example (#4599 )	2019-11-07 11:26:48 +01:00
walterhenry	5563c42ef5	Fixed typo: Added space between "recognize" and "various" (#4600 )	2019-11-06 23:06:36 +01:00
Ines Montani	828ef27a32	Add warnings about 3.8 (resolves #4593 ) [ci skip]	2019-11-05 18:30:11 +01:00
Ines Montani	59358d9b71	Remove box-decoration-break from entities in displacy (#4564 )	2019-10-31 15:09:43 +01:00
Ines Montani	4e1de85e43	Update syntax iterators [ci skip]	2019-10-30 14:31:40 +01:00
Matthew Honnibal	d5509e0989	Support Mish activation (requires Thinc 7.3) (#4536 ) * Add arch for MishWindowEncoder * Support mish in tok2vec and conv window >=2 * Pass new tok2vec settings from parser * Syntax error * Fix tok2vec setting * Fix registration of MishWindowEncoder * Fix receptive field setting * Fix mish arch * Pass more options from parser * Support more tok2vec options in pretrain * Require thinc 7.3 * Add docs [ci skip] * Require thinc 7.3.0.dev0 to run CI * Run black * Fix typo * Update Thinc version Co-authored-by: Ines Montani <ines@ines.io>	2019-10-28 15:16:33 +01:00
Ines Montani	cfffdba7b1	Implement new API for {Phrase}Matcher.add (backwards-compatible) (#4522 ) * Implement new API for {Phrase}Matcher.add (backwards-compatible) * Update docs * Also update DependencyMatcher.add * Update internals * Rewrite tests to use new API * Add basic check for common mistake Raise error with suggestion if user likely passed in a pattern instead of a list of patterns * Fix typo [ci skip]	2019-10-25 22:21:08 +02:00
Ines Montani	d2da117114	Also support passing list to Language.disable_pipes (#4521 ) * Also support passing list to Language.disable_pipes * Adjust internals	2019-10-25 16:19:08 +02:00
Ines Montani	493be8e9db	Update new version identifier [ci skip]	2019-10-25 11:42:49 +02:00
Ines Montani	2abf1028cb	Update docs [ci skip]	2019-10-25 11:27:00 +02:00
Ines Montani	f31876154d	Adjust formatting [ci skip]	2019-10-25 11:19:46 +02:00
Kabir Khan	93640373c7	Make entity_ruler ent_id resolution 2x faster and add docs for… (#4513 ) * Update entityruler.py * Making ent_id resolution 2x faster and adding docs * Fixing newlines in docstrings * Fixing newlines in docstrings	2019-10-25 11:16:42 +02:00
adrianeboyd	1b0bbe4b76	Update tag maps and docs for English and German (#4501 ) * Update English tag_map Update English tag_map based on this conversion table: https://universaldependencies.org/tagset-conversion/en-penn-uposf.html * Update German tag_map Update German tag_map based on this conversion table: https://universaldependencies.org/tagset-conversion/de-stts-uposf.html * Add missing Tiger dependencies to glossary * Add quotes to definition of TO * Update POS/TAG tables in docs Update POS/TAG tables for English and German docs using current information generated from the tag_maps and GLOSSARY. * Update warning that -PRON- is specific to English * Revert docs to default JSON output with convert * Revert "Revert docs to default JSON output with convert" This reverts commit `6b78c048f1`.	2019-10-24 12:56:05 +02:00
adrianeboyd	8516e9d53b	Support train dict format as JSONL (#4471 ) * Support train dict format as JSONL * Add (overly simple) check for dict vs. tuple to read JSONL lines as either train dicts or train tuples * Extend JSON/JSONL roundtrip conversion tests using `docs_to_json()` and `GoldCorpus.train_tuples` * Revert docs to default JSON output with convert	2019-10-23 16:01:44 +02:00
adrianeboyd	7fc39f124c	Fix logic in rules+model entity example [ci skip] (#4510 )	2019-10-23 14:41:21 +02:00
Ines Montani	4659435573	Fix argument type in PhraseMatcher.add docs (closes #4496 ) [ci skip]	2019-10-22 14:37:30 +02:00
Ines Montani	b2f88e2060	Fix formatting [ci skip]	2019-10-21 12:26:07 +02:00
adrianeboyd	3195a8f170	Add Entity Linking to menu (#4489 )	2019-10-21 12:17:30 +02:00
Pepe Berba	7772d5d3c5	Update `vocab.get_vector` docs to include features on Fasttext ngram (#4464 ) * Update `vocab.get_vector` * Added contrib agreement	2019-10-20 01:28:18 +02:00
Ghola	258eb9e064	Misspelling on Lemmatizer Example #4406 (#4449 ) Removing extra o in the lookups = Loookups()	2019-10-16 23:23:15 +02:00
Anastassia	4a77d03ff7	Fix documentation for the docs_to_json function (#4456 )	2019-10-16 23:17:58 +02:00
Ines Montani	573e543e4a	Alphanumeric -> alphabetic [ci skip] see ines/spacy-course#38	2019-10-06 13:30:01 +02:00
Ines Montani	e65dffd80b	Clarify serialization of extension attributes (closes #4377 ) [ci skip]	2019-10-05 11:58:00 +02:00
Sofie Van Landeghem	4e7259c6cf	Bugfix initializing DocBin with attributes (#4368 ) * docbin init fix + documentation fix + unit tests * newline * try with zlib instead of gzip (python 2 incompatibilities)	2019-10-03 14:48:45 +02:00
Ines Montani	ce1d441de5	Add docs for Vectors.most_similar [ci skip]	2019-10-03 14:29:47 +02:00
Ines Montani	80cf385f65	Update v2-2.md [ci skip]	2019-10-02 16:58:21 +02:00
Ines Montani	b6670bf0c2	Use consistent spelling	2019-10-02 10:37:39 +02:00
Ines Montani	475e3188ce	Add docs on filtering overlapping spans for merging (resolves #4352 ) [ci skip]	2019-10-01 21:59:50 +02:00
Ines Montani	0dd127bb00	Update v2-2.md [ci skip]	2019-10-01 21:37:06 +02:00
Ines Montani	cf65a80f36	Refactor lemmatizer and data table integration (#4353 ) * Move test * Allow default in Lookups.get_table * Start with blank tables in Lookups.from_bytes * Refactor lemmatizer to hold instance of Lookups * Get lookups table within the lemmatization methods to make sure it references the correct table (even if the table was replaced or modified, e.g. when loading a model from disk) * Deprecate other arguments on Lemmatizer.__init__ and expect Lookups for consistency * Remove old and unsupported Lemmatizer.load classmethod * Refactor language-specific lemmatizers to inherit as much as possible from base class and override only what they need * Update tests and docs * Fix more tests * Fix lemmatizer * Upgrade pytest to try and fix weird CI errors * Try pytest 4.6.5	2019-10-01 21:36:03 +02:00
Ines Montani	bc7e7db208	Fix wording [ci skip]	2019-10-01 14:20:44 +02:00
Ines Montani	2a3a4565cd	Update infobox [ci skip]	2019-10-01 14:19:34 +02:00
Ines Montani	66aa0d479f	Update v2.2 page [ci skip]	2019-10-01 14:11:05 +02:00
Ines Montani	a8a1800f2a	Update lemma data documentation [ci skip]	2019-10-01 13:22:13 +02:00
Ines Montani	932ad9cb91	Fix typos and formatting [ci skip]	2019-10-01 12:30:04 +02:00
Ines Montani	3d8fd4b461	Revert #4334	2019-09-29 17:32:12 +02:00
Ines Montani	3bd4da068e	Fix link [ci skip]	2019-09-29 17:30:38 +02:00
Ines Montani	089f44cc56	Update serialization docs [ci skip]	2019-09-29 17:11:13 +02:00
Ines Montani	c9cd516d96	Move tests out of package (#4334 ) * Move tests out of package * Fix typo	2019-09-28 18:05:00 +02:00
Ines Montani	10742d3219	Update v2 docs [ci skip]	2019-09-28 15:57:22 +02:00
Ines Montani	f8d1e2f214	Update CLI docs [ci skip]	2019-09-28 13:12:30 +02:00
Ines Montani	59beab8405	Update v2-2.md [ci skip]	2019-09-27 18:10:43 +02:00
Ines Montani	685e4b2554	Update v2-2.md [ci skip]	2019-09-27 16:35:01 +02:00
Ines Montani	aad66d9bb9	Document PhraseMatcher.remove [ci skip]	2019-09-27 16:34:53 +02:00
Ines Montani	eb0649e38e	Fix tag [ci skip]	2019-09-26 16:22:33 +02:00
Ines Montani	da9a869d3f	Update vectors name docs [ci skip]	2019-09-26 16:21:32 +02:00
Em Zhan	aafa091541	Fix typo in documentation (#4322 ) * Fix typo 'probj' instead of 'pobj' * Add spaCy contributor agreement for zqianem	2019-09-25 19:42:18 +02:00
Matthew Honnibal	92ed4dc5e0	Allow vectors name to be set in init-model (#4321 ) * Allow vectors name to be specified in init-model * Document --vectors-name argument to init-model * Update website/docs/api/cli.md Co-Authored-By: Ines Montani <ines@ines.io>	2019-09-25 13:11:00 +02:00
Ines Montani	197406de1d	Update v2-2.md [ci skip]	2019-09-19 14:33:58 +02:00
Ines Montani	ddc09b08ed	Update v2-2.md [ci skip]	2019-09-19 00:58:30 +02:00
Matthew Honnibal	e2047576c4	Fix merge conflict	2019-09-18 21:42:11 +02:00
Matthew Honnibal	46c02d25b1	Merge changes to test_ner	2019-09-18 21:41:24 +02:00
Ines Montani	9c940eab94	Update version in examples [ci skip]	2019-09-18 21:23:26 +02:00
Ines Montani	f873548f6c	Add backwards incompatibility [ci skip]	2019-09-18 21:21:48 +02:00
Ines Montani	6ebdc5f7d2	Update download docs [ci skip]	2019-09-18 21:21:39 +02:00
Ines Montani	dd1810f05a	Update DocBin and add docs	2019-09-18 20:23:21 +02:00
Ines Montani	d62690b3ba	Update examples	2019-09-18 19:57:36 +02:00
Ines Montani	bd435faddd	Add note about usage docs [ci skip]	2019-09-18 19:56:43 +02:00
Matthew Honnibal	931e96b6c7	DocPallet->DocBin in docs	2019-09-18 15:17:26 +02:00
Matthew Honnibal	f537cbeacc	Update v2-2 docs	2019-09-18 14:07:55 +02:00
Ines Montani	ee15fdfe88	Fix wording [ci skip]	2019-09-17 14:59:42 +02:00
Ines Montani	f566e69f38	Fix --vectors-loc docs (closes #4270 )	2019-09-17 14:59:12 +02:00
Ines Montani	25c2b4b9a5	Improve init-model docs (see #4137 )	2019-09-17 14:51:44 +02:00
Ines Montani	198b7e9789	Auto-format [ci skip]	2019-09-17 14:48:35 +02:00
adrianeboyd	b5d999e510	Add textcat to train CLI (#4226 ) * Add doc.cats to spacy.gold at the paragraph level Support `doc.cats` as `"cats": [{"label": string, "value": number}]` in the spacy JSON training format at the paragraph level. * `spacy.gold.docs_to_json()` writes `docs.cats` * `GoldCorpus` reads in cats in each `GoldParse` * Update instances of gold_tuples to handle cats Update iteration over gold_tuples / gold_parses to handle addition of cats at the paragraph level. * Add textcat to train CLI * Add textcat options to train CLI * Add textcat labels in `TextCategorizer.begin_training()` * Add textcat evaluation to `Scorer`: * For binary exclusive classes with provided label: F1 for label * For 2+ exclusive classes: F1 macro average * For multilabel (not exclusive): ROC AUC macro average (currently relying on sklearn) * Provide user info on textcat evaluation settings, potential incompatibilities * Provide pipeline to Scorer in `Language.evaluate` for textcat config * Customize train CLI output to include only metrics relevant to current pipeline * Add textcat evaluation to evaluate CLI * Fix handling of unset arguments and config params Fix handling of unset arguments and model confiug parameters in Scorer initialization. * Temporarily add sklearn requirement * Remove sklearn version number * Improve Scorer handling of models without textcats * Fixing Scorer handling of models without textcats * Update Scorer output for python 2.7 * Modify inf in Scorer for python 2.7 * Auto-format Also make small adjustments to make auto-formatting with black easier and produce nicer results * Move error message to Errors * Update documentation * Add cats to annotation JSON format [ci skip] * Fix tpl flag and docs [ci skip] * Switch to internal roc_auc_score Switch to internal `roc_auc_score()` adapted from scikit-learn. * Add AUCROCScore tests and improve errors/warnings * Add tests for AUCROCScore and roc_auc_score * Add missing error for only positive/negative values * Remove unnecessary warnings and errors * Make reduced roc_auc_score functions private Because most of the checks and warnings have been stripped for the internal functions and access is only intended through `ROCAUCScore`, make the functions for roc_auc_score adapted from scikit-learn private. * Check that data corresponds with multilabel flag Check that the training instances correspond with the multilabel flag, adding the multilabel flag if required. * Add textcat score to early stopping check * Add more checks to debug-data for textcat * Add example training data for textcat * Add more checks to textcat train CLI * Check configuration when extending base model * Fix typos * Update textcat example data * Provide licensing details and licenses for data * Remove two labels with no positive instances from jigsaw-toxic-comment data. Co-authored-by: Ines Montani <ines@ines.io>	2019-09-15 22:31:31 +02:00
Ines Montani	bab9976d9a	💫 Adjust Table API and add docs (#4289 ) * Adjust Table API and add docs * Add attributes and update description [ci skip] * Use strings.get_string_id instead of hash_string * Fix table method calls * Make orth arg in Lemmatizer.lookup optional Fall back to string, which is now handled by Table.__contains__ out-of-the-box * Fix method name * Auto-format	2019-09-15 22:08:13 +02:00
Ines Montani	16c2522791	Merge branch 'master' into develop	2019-09-14 16:42:01 +02:00
Ines Montani	86befc80bf	WIP: Add v2.2 page [ci skip]	2019-09-14 16:41:48 +02:00
Ines Montani	04d36d2471	Remove unused link [ci skip]	2019-09-14 16:41:19 +02:00
Ines Montani	5c8b5e68ec	Fix docs consistency [ci skip]	2019-09-14 16:23:37 +02:00
Ines Montani	bbf7337eaf	Update adding languages docs [ci skip]	2019-09-14 15:32:15 +02:00
Ines Montani	3126dd0904	Tidy up and auto-format [ci skip]	2019-09-14 12:58:06 +02:00
Ines Montani	3c3658ef9f	Merge branch 'master' into develop	2019-09-12 18:03:01 +02:00
Sofie Van Landeghem	9be4d1c105	Allow copying of user_data in as_doc (#4282 ) * Allow copying the user_data with as_doc + unit test * add option to docs * add typing * import fix * workaround to avoid bool clashing ... * bint instead of bool	2019-09-12 17:08:14 +02:00
Ines Montani	ff51fba96a	Update lemmaitzer docs [ci skip]	2019-09-12 16:26:33 +02:00
Ines Montani	25b2b3ff45	Remove LEMMA from exception examples [ci skip]	2019-09-12 16:26:27 +02:00
Ines Montani	82c16b7943	Remove u-strings and fix formatting [ci skip]	2019-09-12 16:11:15 +02:00
Ines Montani	a31e9e1cd5	Update training docs [ci skip]	2019-09-12 15:32:39 +02:00
Ines Montani	b544dcb3c5	Document debug-data [ci skip]	2019-09-12 15:26:20 +02:00
Ines Montani	c0a4cab178	Update "Adding languages" docs [ci skip]	2019-09-12 14:53:06 +02:00
Ines Montani	10257f3131	Document Lookups [ci skip]	2019-09-12 14:00:14 +02:00
Ines Montani	aa4ff0baa1	Auto-format [ci skip]	2019-09-12 13:05:53 +02:00
Ines Montani	625ce2db8e	Update Language docs [ci skip]	2019-09-12 13:03:38 +02:00
Ines Montani	cb41a33d14	Update displaCy API docs [ci skip]	2019-09-12 12:59:20 +02:00
Ines Montani	e7c20ad1d2	Update colors entry points docs [ci skip]	2019-09-12 12:59:10 +02:00
Ines Montani	7b59a919e6	Update entry points docs [ci skip]	2019-09-12 12:52:06 +02:00
Sofie Van Landeghem	0b4b4f1819	Documentation for Entity Linking (#4065 ) * document token ent_kb_id * document span kb_id * update pipeline documentation * prior and context weights as bool's instead * entitylinker api documentation * drop for both models * finish entitylinker documentation * small fixes * documentation for KB * candidate documentation * links to api pages in code * small fix * frequency examples as counts for consistency * consistent documentation about tensors returned by predict * add entity linking to usage 101 * add entity linking infobox and KB section to 101 * entity-linking in linguistic features * small typo corrections * training example and docs for entity_linker * predefined nlp and kb * revert back to similarity encodings for simplicity (for now) * set prior probabilities to 0 when excluded * code clean up * bugfix: deleting kb ID from tokens when entities were removed * refactor train el example to use either model or vocab * pretrain_kb example for example kb generation * add to training docs for KB + EL example scripts * small fixes * error numbering * ensure the language of vocab and nlp stay consistent across serialization * equality with = * avoid conflict in errors file * add error 151 * final adjustements to the train scripts - consistency * update of goldparse documentation * small corrections * push commit * typo fix * add candidate API to kb documentation * update API sidebar with EntityLinker and KnowledgeBase * remove EL from 101 docs * remove entity linker from 101 pipelines / rephrase * custom el model instead of existing model * set version to 2.2 for EL functionality * update documentation for 2 CLI scripts	2019-09-12 11:38:34 +02:00
Sofie Van Landeghem	53a9ca45c9	Docs: bufsize instead of buffsize (#4247 )	2019-09-06 11:11:54 +02:00
Sofie Van Landeghem	6b012cebff	Make pos/tag distinction more clear in docs (#4246 ) * make distinction between tag and pos more prominent in docs * out of the 101	2019-09-06 10:31:21 +02:00
adrianeboyd	82159b5c19	Updates/bugfixes for NER/IOB converters (#4186 ) * Updates/bugfixes for NER/IOB converters * Converter formats `ner` and `iob` use autodetect to choose a converter if possible * `iob2json` is reverted to handle sentence-per-line data like `word1\|pos1\|ent1 word2\|pos2\|ent2` * Fix bug in `merge_sentences()` so the second sentence in each batch isn't skipped * `conll_ner2json` is made more general so it can handle more formats with whitespace-separated columns * Supports all formats where the first column is the token and the final column is the IOB tag; if present, the second column is the POS tag * As in CoNLL 2003 NER, blank lines separate sentences, `-DOCSTART- -X- O O` separates documents * Add option for segmenting sentences (new flag `-s`) * Parser-based sentence segmentation with a provided model, otherwise with sentencizer (new option `-b` to specify model) * Can group sentences into documents with `n_sents` as long as sentence segmentation is available * Only applies automatic segmentation when there are no existing delimiters in the data * Provide info about settings applied during conversion with warnings and suggestions if settings conflict or might not be not optimal. * Add tests for common formats * Add '(default)' back to docs for -c auto * Add document count back to output * Revert changes to converter output message * Use explicit tabs in convert CLI test data * Adjust/add messages for n_sents=1 default * Add sample NER data to training examples * Update README * Add links in docs to example NER data * Define msg within converters	2019-08-29 12:04:01 +02:00
Björn Böing	bae0455f91	Fix visualizer options linking for displaCy. (#4202 )	2019-08-27 14:04:28 +02:00
Christos Aridas	61f5c007a0	DOC Fix pipeline functions examples (#4189 )	2019-08-23 19:15:32 +02:00
adrianeboyd	8fe7bdd0fa	Improve token pattern checking without validation (#4105 ) * Fix typo in rule-based matching docs * Improve token pattern checking without validation Add more detailed token pattern checks without full JSON pattern validation and provide more detailed error messages. Addresses #4070 (also related: #4063, #4100). * Check whether top-level attributes in patterns and attr for PhraseMatcher are in token pattern schema * Check whether attribute value types are supported in general (as opposed to per attribute with full validation) * Report various internal error types (OverflowError, AttributeError, KeyError) as ValueError with standard error messages * Check for tagger/parser in PhraseMatcher pipeline for attributes TAG, POS, LEMMA, and DEP * Add error messages with relevant details on how to use validate=True or nlp() instead of nlp.make_doc() * Support attr=TEXT for PhraseMatcher * Add NORM to schema * Expand tests for pattern validation, Matcher, PhraseMatcher, and EntityRuler * Remove unnecessary .keys() * Rephrase error messages * Add another type check to Matcher Add another type check to Matcher for more understandable error messages in some rare cases. * Support phrase_matcher_attr=TEXT for EntityRuler * Don't use spacy.errors in examples and bin scripts * Fix error code * Auto-format Also try get Azure pipelines to finally start a build :( * Update errors.py Co-authored-by: Ines Montani <ines@ines.io> Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2019-08-21 14:00:37 +02:00
Ines Montani	3134a9b6e0	Add section on expanding regex match to token boundaries (see #4158 ) [ci skip]	2019-08-21 12:53:31 +02:00
Ines Montani	fe230c8776	Fix typo [ci skip]	2019-08-20 13:02:05 +02:00
Daniel Bourke	b0a28fd0de	fix PhraseMatcher link typo (#4150 ) /api/phtasematcher -> /api/phrasematcher	2019-08-20 13:01:43 +02:00
Ines Montani	ce4c3e5204	Document force flag on set_extension (closes #4148 )	2019-08-19 19:22:07 +02:00
Ines Montani	66aba2d676	Improve regex matching docs [ci skip]	2019-08-19 13:59:41 +02:00
Sofie Van Landeghem	cc66f47893	Make enabling/disabling jupyter mode more explicit (#4144 ) * make enabling/disabling jupyter mode more explicit * markup fix	2019-08-19 11:53:34 +02:00
Ines Montani	e520eb3f6c	Make visualized NER examples more clear (closes #4104 ) [ci skip]	2019-08-18 16:29:29 +02:00
Ines Montani	1362f793cf	Improve docs on phrase pattern attributes (closes #4100 ) [ci skip]	2019-08-11 11:13:49 +02:00
Ines Montani	8b4a0fabbb	Adjust docs example [ci skip]	2019-08-07 00:46:47 +02:00
adrianeboyd	69aca7d839	Add validate option to EntityRuler (#4089 ) * Add validate option to EntityRuler * Add validate to EntityRuler, passed to Matcher and PhraseMatcher * Add validate to usage and API docs * Update website/docs/usage/rule-based-matching.md Co-Authored-By: Ines Montani <ines@ines.io> * Update website/docs/usage/rule-based-matching.md Co-Authored-By: Ines Montani <ines@ines.io>	2019-08-07 00:40:53 +02:00
Ines Montani	4ae320e5c2	Use consistent casing for entity ruler patterns (see #4063 ) [ci skip]	2019-08-06 12:20:22 +02:00
Ines Montani	223bde5cf6	Improve docs on matcher attributes [ci skip] (closes #4063 )	2019-08-06 12:13:42 +02:00
Ines Montani	2bfae0b167	Auto-format	2019-08-06 12:13:31 +02:00
Ines Montani	0f76e0022d	Update .tensor docs [ci skip]	2019-08-01 18:37:09 +02:00
Björn Böing	a83c0add2e	Add links to tokenizer API docs to refer relevant information. (#4064 ) * Add links to tokenizer API docs to refer relevant information. * Add suggested changes Co-Authored-By: Ines Montani <ines@ines.io>	2019-08-01 14:28:38 +02:00
Ejar	2cdf7d39e7	Corrected imported fucntion (#4062 ) The example showed an incorrected import	2019-08-01 12:43:36 +02:00
Ines Montani	fcd2f7f656	Fix version introducing Span.ents (closes #4045 ) [ci skip]	2019-07-30 10:32:33 +02:00
Ines Montani	fc69da0acb	💫 Support simple training format in nlp.evaluate and add tests (#4033 ) * Support simple training format in nlp.evaluate and add tests * Update docs [ci skip]	2019-07-27 17:30:18 +02:00
Ines Montani	bd39e5e630	Add "Processing text" section [ci skip]	2019-07-25 17:38:03 +02:00
Ines Montani	a5e3d2f318	Improve section on disabling pipes [ci skip]	2019-07-25 14:25:34 +02:00
Ines Montani	02e444ec7c	Add section on special tokenizer component [ci skip]	2019-07-25 14:25:03 +02:00
Ines Montani	1fa6d6ba55	Improve consistency of docs examples [ci skip]	2019-07-25 14:24:56 +02:00
adrianeboyd	784a5f4284	Update GoldParse attributes in API docs (#4023 ) * add `words` * update name of entity list to `ner` I think it might be a bit more consistent to have `ner` named `entities` or `ents` (and `ents` is actually set somewhere to `None`, which is a bit confusing), but it looks like renaming it would be a non-trivial decision.	2019-07-25 12:14:02 +02:00
Adriane Boyd	6c5044ed2a	Update annotation docs for German - minor formatting fixes - remove STTS tags not used in Tiger - update list of dependency relations to match tiger2dep	2019-07-22 11:59:03 +02:00
adrianeboyd	d2c474cbb7	Fix initial example in EntityRuler API docs (#3999 )	2019-07-22 11:18:55 +02:00
Ines Montani	1167c303a0	Fix typos [ci skip]	2019-07-19 13:08:18 +02:00
BreakBB	6d9a7c0749	Add '--silent' argument to bash example of CLI Info	2019-07-19 10:00:45 +02:00
BreakBB	c8ba0f690d	Fix --force parameter of CLI package	2019-07-19 10:00:45 +02:00
Ines Montani	a0acb1b3cd	Also add infobox to API docs [ci skip]	2019-07-17 16:26:41 +02:00
Ines Montani	c3ead02ea5	Adjust wording [ci skip]	2019-07-17 16:06:25 +02:00
Ines Montani	1d5ff3e455	Add infobox	2019-07-17 15:29:36 +02:00
Ines Montani	114cb18892	Improve wording	2019-07-17 15:27:53 +02:00
Ines Montani	7522beef9e	Add "Things to try" prompts	2019-07-17 15:25:02 +02:00
Ines Montani	9f02e3c027	Adjust example Not actually supported in this alignment interpretation	2019-07-17 15:13:50 +02:00
Ines Montani	1ea472468a	Add usage docs for aligning tokenization	2019-07-17 15:08:33 +02:00
Ines Montani	f97a555445	Add API documentation	2019-07-17 14:30:04 +02:00
pmbaumgartner	9a86d95ea2	fix custom attribute links	2019-07-14 20:23:54 -04:00
Ines Montani	40cd03fc35	Improve EntityRuler serialization	2019-07-10 12:25:45 +02:00
Ines Montani	8721849423	Update Scorer.ents_per_type	2019-07-10 11:19:28 +02:00
Ines Montani	ebe58e7fa1	Document gold.docs_to_json [ci skip]	2019-07-10 10:27:33 +02:00
Ines Montani	881f5bc401	Auto-format	2019-07-10 10:27:29 +02:00
Björn Böing	205c73a589	Update tokenizer and doc init example (#3939 ) * Fix Doc.to_json hyperlink * Update tokenizer and doc init examples * Change "matchin rules" to "punctuation rules" * Auto-format	2019-07-10 10:16:48 +02:00
Björn Böing	04982ccc40	Update pretrain to prevent unintended overwriting of weight fil… (#3902 ) * Update pretrain to prevent unintended overwriting of weight files for #3859 * Add '--epoch-start' to pretrain docs * Add mising pretrain arguments to bash example * Update doc tag for v2.1.5	2019-07-09 21:48:30 +02:00
Joshua Smith	2eb925bd05	Added an argument to `EntityRuler` constructor to pass attrs to… (#3919 ) * Perserve flags in EntityRuler The EntityRuler (explosion/spaCy#3526) does not preserve overwrite flags (or `ent_id_sep`) when serialized. This commit adds support for serialization/deserialization preserving overwrite and ent_id_sep flags. * add signed contributor agreement * flake8 cleanup mostly blank line issues. * mark test from the issue as needing a model The test from the issue needs some language model for serialization but the test wasn't originally marked correctly. * Adds `phrase_matcher_attr` to allow args to PhraseMatcher This is an added arg to pass to the `PhraseMatcher`. For example, this allows creation of a case insensitive phrase matcher when the `EntityRuler` is created. References explosion/spaCy#3822 * remove unneeded model loading The model didn't need to be loaded, and I replaced it with a change that doesn't require it (using existings fixtures) * updated docstring for new argument * updated docs to reflect new argument to the EntityRuler constructor * change tempdir handling to be compatible with python 2.7 * return conflicted code to entityruler Some stuff got cut out because of merge conflicts, this returns that code for the phrase_matcher_attr. * fixed typo in the code added back after conflicts * flake8 compliance When I deconflicted the branch there were some flake8 issues introduced. This resolves the spacing problems. * test changes: attempts to fix flaky test in python3.5 These tests seem to be alittle flaky in 3.5 so I changed the check to avoid the comparisons that seem to be fail sometimes.	2019-07-09 20:09:17 +02:00
Ines Montani	d361e380b8	Fix matcher callback example (closes #3862 )	2019-06-26 14:47:26 +02:00
Guillaume Claret	d7a519a922	Typo (#3865 ) * Typo * Add contributor agreement	2019-06-20 10:31:19 +02:00
Björn Böing	ebf5a04d6c	Update pretrain docs and add unsupported loss_func error (#3860 ) * Add error to `get_vectors_loss` for unsupported loss function of `pretrain` * Add missing "--loss-func" argument to pretrain docs. Update pretrain plac annotations to match docs. * Add missing quotation marks	2019-06-20 10:30:44 +02:00
Alejandro Alcalde	4866a7ee9e	Changed learning rate by its param name. (#3855 ) * Changed learning rate by its param name. I've been searching for a while how the parameter learning rate was named, with `beta1` and `beta2` its easy as they are marked as code, but learning rate wasn't. I think writing the actual parameter name would be helpful. * Signing SCA	2019-06-20 10:29:20 +02:00
Ines Montani	81c12640ab	Auto-format [ci skip]	2019-06-16 14:33:20 +02:00
Greg Werner	9041a72d7f	Update tokenizer.md for construction example (#3790 ) * Update tokenizer.md for construction example Self contained example. You should really say what nlp is so that the example will work as is * Update CONTRIBUTOR_AGREEMENT.md * Restore contributor agreement * Adjust construction examples	2019-06-16 14:32:56 +02:00
BreakBB	d8573ee715	Update error raising for CLI pretrain to fix #3840 (#3843 ) * Add check for empty input file to CLI pretrain * Raise error if JSONL is not a dict or contains neither `tokens` nor `text` key * Skip empty values for correct pretrain keys and log a counter as warning * Add tests for CLI pretrain core function make_docs. * Add a short hint for the `tokens` key to the CLI pretrain docs * Add success message to CLI pretrain * Update model loading to fix the tests * Skip empty values and do not create docs out of it	2019-06-16 13:22:57 +02:00
Motoki Wu	9c064e6ad9	Add resume logic to spacy pretrain (#3652 ) * Added ability to resume training * Add to readmee * Remove duplicate entry	2019-06-12 13:29:23 +02:00
Ramanan Balakrishnan	eb12703d10	minor fix to broken link in documentation (#3819 ) [ci skip]	2019-06-04 11:15:35 +02:00
Ines Montani	0c74506c9c	Fix typos in docs (closes #3802 ) [ci skip]	2019-06-01 11:35:01 +02:00
Nipun Sadvilkar	1f13005751	Incorrect Token attribute ent_iob_ description (#3800 ) * Incorrect Token attribute ent_iob_ description * Add spaCy contributor agreement	2019-05-31 16:50:45 +02:00
Ramanan Balakrishnan	26c37c5a4d	fix all references to BILUO annotation format (#3797 )	2019-05-31 12:19:19 +02:00
mak	89379a7fa4	Corrected example model URL in requirements.txt (#3786 ) The URL used to show how to add a model to the requirements.txt had the old release path (excl. explosion).	2019-05-29 10:51:55 +02:00
Ines Montani	7634812172	Document Language.evaluate	2019-05-24 14:06:36 +02:00
Ines Montani	45e6855550	Update Language.update docs	2019-05-24 14:06:26 +02:00
Ines Montani	b78a8dc1d2	Update Scorer and add API docs	2019-05-24 14:06:04 +02:00
Ines Montani	321c9f5acc	Fix lex_id docs (closes #3743 )	2019-05-16 23:15:58 +02:00
Ines Montani	f96af8526a	Merge branch 'spacy.io' [ci skip]	2019-05-11 23:03:56 +02:00
Ines Montani	7534f7cb44	Fix return value of Language.update (closes #3692 )	2019-05-11 18:40:19 +02:00
devforfu	21af12eb53	Make "text" key in JSONL format optional when "tokens" key is provided (#3721 ) * Fix issue with forcing text key when it is not required * Extending the docs to reflect the new behavior	2019-05-11 15:41:29 +02:00
Ines Montani	6cfa1e1f47	Fix DependencyParser.predict docs (resolves #3561 )	2019-05-11 15:37:54 +02:00
Ines Montani	25f5592d57	Improve Token.prob and Lexeme.prob docs (resolves #3701 )	2019-05-11 15:23:41 +02:00
Aaron Kub	719a15f23d	fixing regex matcher examples (#3708 ) (#3719 )	2019-05-10 14:23:52 +02:00
Ines Montani	65b55f1aaa	Add version tag to `--base-model` argument (closes #3720 )	2019-05-10 14:06:47 +02:00
Ines Montani	505c9e0e19	Add util.filter_spans helper (#3686 )	2019-05-08 02:33:40 +02:00
张晓飞	ba1ff00370	update response after calling add_pipe (#3661 ) * update response after calling add_pipe component:print_info is appened in the last, so need show it at the end of pipeline * Create henry860916.md	2019-05-01 12:02:18 +02:00
Ramiro Gómez	8ee4100f8f	Remove dangling M (#3657 ) I assume this is a typo. Sorry if it has a meaning that I'm not aware of.	2019-04-29 19:44:43 +02:00
Amit Chaudhary	167d63af31	Fix broken link to Dive Into Python 3 website (#3656 ) * Fix broken link to Dive Into Python 3 website * Sign spaCy Contributor Agreement	2019-04-29 19:44:00 +02:00
Ivan Tham	fa94f83697	Improve redundant variable name (#3643 ) * Improve redundant variable name * Apply suggestions from code review Co-Authored-By: pickfire <pickfire@riseup.net>	2019-04-26 16:50:14 +02:00
Ines Montani	ec0d840ab5	Document early stopping	2019-04-22 14:31:32 +02:00
Ines Montani	1d567913f9	Update spacy evaluate example	2019-04-22 14:28:42 +02:00
Ines Montani	7917ce2f73	Make flag shortcut consistent and document	2019-04-22 14:23:44 +02:00
Ines Montani	52658c80d5	Allow jupyter=False to override Jupyter mode (closes #3598 )	2019-04-22 14:18:32 +02:00
Motoki Wu	8e2cef49f3	Add save after `--save-every` batches for `spacy pretrain` (#3510 ) <!--- Provide a general summary of your changes in the title. --> When using `spacy pretrain`, the model is saved only after every epoch. But each epoch can be very big since `pretrain` is used for language modeling tasks. So I added a `--save-every` option in the CLI to save after every `--save-every` batches. ## Description <!--- Use this section to describe your changes. If your changes required testing, include information about the testing environment and the tests you ran. If your test fixes a bug reported in an issue, don't forget to include the issue number. If your PR is still a work in progress, that's totally fine – just include a note to let us know. --> To test... Save this file to `sample_sents.jsonl` ``` {"text": "hello there."} {"text": "hello there."} {"text": "hello there."} {"text": "hello there."} {"text": "hello there."} {"text": "hello there."} {"text": "hello there."} {"text": "hello there."} {"text": "hello there."} {"text": "hello there."} {"text": "hello there."} {"text": "hello there."} {"text": "hello there."} {"text": "hello there."} {"text": "hello there."} {"text": "hello there."} ``` Then run `--save-every 2` when pretraining. ```bash spacy pretrain sample_sents.jsonl en_core_web_md here -nw 1 -bs 1 -i 10 --save-every 2 ``` And it should save the model to the `here/` folder after every 2 batches. The models that are saved during an epoch will have a `.temp` appended to the save name. At the end the training, you should see these files (`ls here/`): ```bash config.json model2.bin model5.bin model8.bin log.jsonl model2.temp.bin model5.temp.bin model8.temp.bin model0.bin model3.bin model6.bin model9.bin model0.temp.bin model3.temp.bin model6.temp.bin model9.temp.bin model1.bin model4.bin model7.bin model1.temp.bin model4.temp.bin model7.temp.bin ``` ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> This is a new feature to `spacy pretrain`. 🌵 Unfortunately, I haven't been able to test this because compiling from source is not working (cythonize error). ``` Processing matcher.pyx [Errno 2] No such file or directory: '/Users/mwu/github/spaCy/spacy/matcher.pyx' Traceback (most recent call last): File "/Users/mwu/github/spaCy/bin/cythonize.py", line 169, in <module> run(args.root) File "/Users/mwu/github/spaCy/bin/cythonize.py", line 158, in run process(base, filename, db) File "/Users/mwu/github/spaCy/bin/cythonize.py", line 124, in process preserve_cwd(base, process_pyx, root + ".pyx", root + ".cpp") File "/Users/mwu/github/spaCy/bin/cythonize.py", line 87, in preserve_cwd func(args) File "/Users/mwu/github/spaCy/bin/cythonize.py", line 63, in process_pyx raise Exception("Cython failed") Exception: Cython failed Traceback (most recent call last): File "setup.py", line 276, in <module> setup_package() File "setup.py", line 209, in setup_package generate_cython(root, "spacy") File "setup.py", line 132, in generate_cython raise RuntimeError("Running cythonize failed") RuntimeError: Running cythonize failed ``` Edit: Fixed! after deleting all `.cpp` files: `find spacy -name ".cpp" \| xargs rm` ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.	2019-04-22 14:10:16 +02:00
Ines Montani	0dce4585b1	Add course to 101	2019-04-19 15:59:51 +02:00
Ines Montani	2efc87c382	Remove unused image	2019-04-19 15:48:12 +02:00
Ines Montani	38395d9518	Merge branch 'spacy.io'	2019-04-19 15:26:20 +02:00
Ines Montani	7ac5bb0a7b	Update landing and feature overview	2019-04-19 15:23:08 +02:00
fizban99	f2f2df6e78	entity types for colors should be in uppercase (#3599 ) although the text indicates the entity types should be in lowercase, the sample code shows uppercase, which is the correct format.	2019-04-17 11:22:56 +02:00
Ines Montani	5289dd1356	Fix formatting	2019-04-13 17:58:26 +02:00
Ines Montani	9e7deeaf48	Remove Datacamp	2019-04-13 17:46:32 +02:00
Santiago Castro	86e4b68aa9	Fix website docs for Vectors.from_glove (#3565 ) * Fix website docs for Vectors.from_glove * Add myself as a contributor	2019-04-10 15:23:27 +02:00
Bharat Raghunathan	72820896d4	Fix typo in web docs cli.md (#3559 )	2019-04-09 11:40:03 +02:00
pierremonico	0d26bfe677	Removes duplicate in table (#3550 ) * Removes duplicate in table Just fixing typos. * Remove newline Co-authored-by: Ines Montani <ines@ines.io>	2019-04-08 10:30:42 +02:00
Ines Montani	2f0f439c54	Remove non-existent example (closes #3533 )	2019-04-03 09:59:17 +02:00
Samuel Kane	06a1846379	fix(util): fix decaying function output (#3495 ) * fix(util): fix decaying function output * fix(util): better test and adhere to code standards * fix(util): correct variable name, pytestify test, update website text	2019-03-28 13:24:47 +01:00
Bharat Raghunathan	1db3e47509	DOC: Update tokenizer docs to include default value for batch_size in pipe (#3492 )	2019-03-28 12:48:02 +01:00
Ines Montani	200d8bdb3c	Merge branch 'spacy.io' [ci skip]	2019-03-23 16:46:34 +01:00
Ines Montani	1e5b917d75	Fix formatting [ci skip]	2019-03-23 16:45:50 +01:00
Matthew Honnibal	6c783f8045	Bug fixes and options for TextCategorizer (#3472 ) * Fix code for bag-of-words feature extraction The _ml.py module had a redundant copy of a function to extract unigram bag-of-words features, except one had a bug that set values to 0. Another function allowed extraction of bigram features. Replace all three with a new function that supports arbitrary ngram sizes and also allows control of which attribute is used (e.g. ORTH, LOWER, etc). * Support 'bow' architecture for TextCategorizer This allows efficient ngram bag-of-words models, which are better when the classifier needs to run quickly, especially when the texts are long. Pass architecture="bow" to use it. The extra arguments ngram_size and attr are also available, e.g. ngram_size=2 means unigram and bigram features will be extracted. * Fix size limits in train_textcat example * Explain architectures better in docs	2019-03-23 16:44:44 +01:00
Ines Montani	06bf130890	💫 Add better and serializable sentencizer (#3471 ) * Add better serializable sentencizer component * Replace default factory * Add tests * Tidy up * Pass test * Update docs	2019-03-23 15:45:02 +01:00
Ines Montani	b532386a60	Fix typo [ci skip]	2019-03-22 18:36:17 +01:00
Ines Montani	5073ce63fd	Merge branch 'spacy.io' [ci skip]	2019-03-22 15:17:11 +01:00
Ines Montani	0712efc6b3	Update version requirements [ci skip]	2019-03-21 10:23:54 +01:00
Ines Montani	dac8f8ff99	Update Span.__init__ docs (see #3445 ) [ci skip]	2019-03-20 17:24:17 +01:00

... 9 10 11 12 13 ...

1543 Commits