spaCy

mirror of https://github.com/explosion/spaCy.git synced 2024-12-27 02:16:32 +03:00

Author	SHA1	Message	Date
Matthew Honnibal	4f82a02b70	Remove 'fix_pretrained_vectors_name' hack	2020-08-25 14:37:45 +02:00
Adriane Boyd	0bab7c8b91	Remove PRON_LEMMA symbol (#5968 )	2020-08-25 14:21:29 +02:00
Hiroshi Matsuda	332803eda9	fix ja leading spaces (#5969 ) * change condition for space after * add NAUGHTY_STRINGS test example	2020-08-25 14:16:24 +02:00
Ines Montani	dd84577a98	Update CLI utils, project.yml schema and add test	2020-08-25 11:54:53 +02:00
Shashank	450720aca2	Added support for Sanskrit language (#5956 ) * Added support for Sanskrit language * Added tests for lexical attribute like_num	2020-08-25 10:56:29 +02:00
Matthew Honnibal	ef43152af4	Update scorer	2020-08-25 02:42:47 +02:00
Matthew Honnibal	8d6e1ce306	Update v3.0.0a11	2020-08-25 00:32:08 +02:00
Matthew Honnibal	8038b87f04	Various small tweaks to project CLI (#5965 ) * Fix up/download of http and local paths * Support git_sparse_checkout for assets * Fix scorer * Handle already-present directories for git assets * Improve convert command * Fix support for existant files in git assets * Support branches in git sparse checkout * Format * Fix git assets * Document git block in assets * Fix test * Fix test * Revert "Fix test" This reverts commit `cf3097260f`. * Revert "Fix test" This reverts commit `964d636e27`. * Dont multiply p/r/f by 100 * Display scores * 100 during training	2020-08-25 00:30:52 +02:00
Adriane Boyd	abd3f2b65a	Rename Polish lemmatizer method (#5960 ) Rename Polish lemmatizer method to `pos_lookup` to distinguish it from pure token-based lookup methods.	2020-08-25 00:22:27 +02:00
Ines Montani	e12b03358b	Support removing extra values in fill-config (#5966 ) * Support removing extra values in fill-config * Fix test	2020-08-24 22:53:47 +02:00
Matthew Honnibal	f232d8db96	Report p/r/f out of 100	2020-08-24 17:17:23 +02:00
Ines Montani	0e7f99da58	Fix handling of optional [pretraining] block (#5954 ) * Fix handling of optional [pretraining] block * Remote pretraining from default config * Fix test * Add schema option for empty pretrain block	2020-08-24 15:56:03 +02:00
idoshr	b10c7bc56e	Hebrew like num (#5952 ) * Update stop_words.py Hebrew STOP WORDS * Update stop_words.py * contributor * contributor * add some common domain extentions support human number 1K/1M.... * support human number 1K/1M.... * hebrew number tokenize 1K/1M implement in EN * test human tokenize fix * test * heb like num revert human number change * heb like num	2020-08-24 14:30:05 +02:00
Matthew Honnibal	64df37643f	Update lockfile after project pull	2020-08-24 03:27:09 +02:00
Matthew Honnibal	588c28fe45	Fix project pull when deps missing	2020-08-24 01:23:36 +02:00
Matthew Honnibal	001546c19e	Set version to v3.0.0a10	2020-08-23 21:15:38 +02:00
Matthew Honnibal	160a855246	Format	2020-08-23 21:15:12 +02:00
Matthew Honnibal	89f5b8abb3	Fix project push	2020-08-23 21:14:44 +02:00
Matthew Honnibal	3828bc3ed0	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2020-08-23 18:32:24 +02:00
Matthew Honnibal	e559867605	Allow spacy project to push and pull to/from remote storage (#5949 ) * Add utils for working with remote storage * WIP add remote_cache for project * WIP add push and pull commands * Use pathy in remote_cache * Updarte util * Update remote_cache * Update util * Update project assets * Update pull script * Update push script * Fix type annotation in util * Work on remote storage * Remove site and env hash * Fix imports * Fix type annotation * Require pathy * Require pathy * Fix import * Add a util to handle project variable substitution * Import push and pull commands * Fix pull command * Fix push command * Fix tarfile in remote_storage * Improve printing * Fiddle with status messages * Set version to v3.0.0a9 * Draft docs for spacy project remote storages * Update docs [ci skip] * Use Thinc config to simplify and unify template variables * Auto-format * Don't import Pathy globally for now Causes slow and annoying Google Cloud warning * Tidy up test * Tidy up and update tests * Update to latest Thinc * Update docs * variables -> vars * Update docs [ci skip] * Update docs [ci skip] Co-authored-by: Ines Montani <ines@ines.io>	2020-08-23 18:32:09 +02:00
Matthew Honnibal	fe1cf7e124	Allow score_weights to list extra scores	2020-08-23 18:31:30 +02:00
Ines Montani	9bdc9e81f5	Fix error message [ci skip]	2020-08-23 12:14:02 +02:00
Sofie Van Landeghem	56eabcb2f2	Adding num_like test for Czech (#5946 ) * Create lex_attrs.py Hello, I am missing a CZECH language in SpaCy. So I would like to help to push it a little. This file is base on others lex_attrs.py files just with translation to Czech. * Update __init__.py Updated for use with new Czech Lex_attrs file * Update stop_words.py * Create test_text.py * add like_num testing for czech Co-authored-by: holubvl3 <47881982+holubvl3@users.noreply.github.com> Co-authored-by: holubvl3 <vilemrousi@gmail.com> Co-authored-by: Vladimír Holubec <vholubec@arcdata.cz>	2020-08-21 17:06:33 +02:00
holubvl3	a341b4ef09	Adding support for Czech language (#5826 ) * Create lex_attrs.py Hello, I am missing a CZECH language in SpaCy. So I would like to help to push it a little. This file is base on others lex_attrs.py files just with translation to Czech. * Update __init__.py Updated for use with new Czech Lex_attrs file * Update stop_words.py * Create test_text.py Co-authored-by: Vladimír Holubec <vholubec@arcdata.cz>	2020-08-21 16:17:53 +02:00
svlandeg	af36d77d01	fix typo in docstring	2020-08-21 15:56:03 +02:00
svlandeg	3060e4ae65	Merge remote-tracking branch 'upstream/develop' into feature/docs-docs-docs # Conflicts: # website/src/widgets/quickstart-training-generator.js	2020-08-21 15:16:30 +02:00
svlandeg	cc926267f8	small fixes	2020-08-21 15:05:40 +02:00
Ines Montani	aa6a7cd6e7	Update docs and consistency [ci skip]	2020-08-21 13:49:18 +02:00
Ines Montani	3826cfb8fe	Merge pull request #5930 from svlandeg/feature/init-config-fix UX for init config	2020-08-21 12:06:33 +02:00
Ines Montani	79af7dcd6d	Small wording adjustments [ci skip]	2020-08-21 12:06:19 +02:00
Ines Montani	e60442d83a	Adjust label casing in displaCy NER visualizer (resolves #4866 ) - Accept any case for label names in ents and colors option, even if actual predicted label uses different casing - Don't text-transform: uppercase visually, if it's important to users that the label is represented as-is in the UI	2020-08-21 11:51:31 +02:00
Matthew Honnibal	c356e62908	Minor adjustments to quickstart template	2020-08-21 00:10:21 +02:00
Ines Montani	6ad59d59fe	Merge branch 'develop' of https://github.com/explosion/spaCy into develop [ci skip]	2020-08-20 11:20:58 +02:00
Sofie Van Landeghem	071c09ff35	add coding (#5942 )	2020-08-20 11:08:38 +02:00
Ines Montani	ea6640ea72	Merge pull request #5939 from explosion/feature/thinc-v8.0.0a28 Update Thinc and config variables	2020-08-19 21:14:36 +02:00
Ines Montani	3dd390b1a1	Update Thinc and config variables	2020-08-19 19:46:12 +02:00
svlandeg	b96cd9fa5e	fix typo	2020-08-19 18:46:08 +02:00
Ines Montani	e2f2ef3a5a	Update init config and recommendations - As much as I dislike YAML, it seemed like a better format here because it allows us to add comments if we want to explain the different recommendations - Don't include the generated JS in the repo by default and build it on the fly when running or deploying the site. This ensures it's always up to date. - Simplify jinja_to_js script and use fewer dependencies	2020-08-19 13:33:15 +02:00
Ines Montani	2285e59765	Merge pull request #5933 from svlandeg/feature/more-v3-docs [ci skip]	2020-08-19 11:29:02 +02:00
Matthew Honnibal	c0f6e77a41	Set version to v3.0.0a8	2020-08-18 23:29:00 +02:00
svlandeg	a8acedd4ba	example of custom reader and batcher	2020-08-18 19:15:16 +02:00
Sofie Van Landeghem	358cbb21e3	Define candidate generator in EL config (#5876 ) * candidate generator as separate part of EL config * update comment * ent instead of str as input for candidate generation * Span instead of str: correct type indication * fix types * unit test to create new candidate generator * fix replace_pipe argument passing * move error message, general cleanup * add vocab back to KB constructor * provide KB as callable from Vocab arg * rename to kb_loader, fix KB serialization as part of the EL pipe * fix typo * reformatting * cleanup * fix comment * fix wrongly duplicated code from merge conflict * rename dump to to_disk * from_disk instead of load_bulk * update test after recent removal of set_morphology in tagger * remove old doc	2020-08-18 16:10:36 +02:00
Sofie Van Landeghem	688e77562b	Train CLI script fixes (#5931 ) * fix dash replacement in overrides arguments * perform interpolation on training config * make sure only .spacy files are read	2020-08-18 16:06:37 +02:00
Ines Montani	82f0e20318	Update docs and consistency [ci skip]	2020-08-18 14:39:40 +02:00
svlandeg	10e67b400c	output_file required, spacy-transformers prefered instead of required	2020-08-18 13:38:43 +02:00
Ines Montani	1c3bcfb488	Update docs and util consistency	2020-08-18 01:22:59 +02:00
Ines Montani	990c6b4c32	Update docs and CLI [ci skip]	2020-08-17 21:38:20 +02:00
Ines Montani	3ae5e02f4f	Update docs, types and API consistency	2020-08-17 16:45:24 +02:00
Matthew Honnibal	a95a36ce2a	Set version to v3.0.0a7	2020-08-16 15:51:05 +02:00
Ines Montani	6ae83bde0c	Fix CLI consistency [ci skip]	2020-08-16 15:46:29 +02:00
Ines Montani	45f13cbf64	Merge pull request #5916 from explosion/feature/new-thinc-config	2020-08-16 15:24:12 +02:00
Ines Montani	34bda91695	Show warnings if there's nothing to auto-fill	2020-08-16 14:19:43 +02:00
Ines Montani	dd5804d499	Update type hints	2020-08-16 14:19:33 +02:00
Ines Montani	a570c304df	Update quickstart, template and docs	2020-08-15 14:50:29 +02:00
Ines Montani	3272a63430	Merge pull request #5920 from explosion/fix/logging-warning-various	2020-08-15 14:41:15 +02:00
Ines Montani	fdcde9b0bf	Add init fill-config	2020-08-14 16:49:26 +02:00
Matthew Honnibal	9ebf39fb5f	Relax test	2020-08-14 16:31:09 +02:00
Ines Montani	8128e5eb35	Replace lexeme_norm warning with logging	2020-08-14 15:00:52 +02:00
Ines Montani	37814b608d	Remove env_opt and simplfy default Optimizer	2020-08-14 14:59:54 +02:00
Ines Montani	ab1d165bba	Pass optimizer defined in config to resume/begin_training Otherwise, this would create a default optimizer, which isn't what we want?	2020-08-14 14:59:22 +02:00
Ines Montani	e4d0990857	Only receive from listener if listener exists	2020-08-14 14:58:48 +02:00
Ines Montani	cef97e4b63	Fix path check	2020-08-14 14:58:18 +02:00
Ines Montani	db2dbc8e59	Remove unused warning	2020-08-14 14:58:03 +02:00
Ines Montani	67cc39af7f	Update Thinc and include section order	2020-08-14 14:06:22 +02:00
Ines Montani	88b0a96801	Update for new Thinc and adjust config	2020-08-13 17:38:30 +02:00
Adam Bittlingmayer	7b33b2854f	Add Armenian sentence-final verchaket, Greek question mark and Arabic question mark to default punct (#5910 ) * Add Armenian sentence-final verchaket * Add Greek and Arabic question marks, and contributor agreement * Check box	2020-08-12 15:36:14 +02:00
graue70	49e690bde1	Fix typos in comments (#5904 ) * Fix typo in comment * Fix typo * Add spaCy Contributor Agreement	2020-08-12 15:35:25 +02:00
graue70	ba84371ab0	Use init parameter (#5909 )	2020-08-11 23:41:58 +02:00
Ines Montani	950832f087	Tidy up pipes (#5906 ) * Tidy up pipes * Fix init, defaults and raise custom errors * Update docs * Update docs [ci skip] * Apply suggestions from code review Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com> * Tidy up error handling and validation, fix consistency * Simplify get_examples check * Remove unused import [ci skip] Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-08-11 23:29:31 +02:00
Ines Montani	f79e4c094d	Remove generic type Seems to cause error on Python 3.8 with Cython?	2020-08-10 17:24:30 +02:00
Ines Montani	c099f6eece	Add Token.lex	2020-08-10 16:43:52 +02:00
Ines Montani	933a7cf8d1	Fix Lexeme.from_ptr	2020-08-10 16:43:37 +02:00
Ines Montani	64f2f84098	Update docstrings and docs [ci skip]	2020-08-10 13:45:22 +02:00
Ines Montani	a4b448eec4	Remove unused compiler flag	2020-08-10 13:13:18 +02:00
Ines Montani	3eaeb73342	Tidy up and auto-format	2020-08-09 22:36:23 +02:00
Ines Montani	d5c78c7a34	Update docs and fix consistency	2020-08-09 22:31:52 +02:00
Ines Montani	7c6854d8d4	Fix missing imports	2020-08-09 22:28:29 +02:00
Matthew Honnibal	0fc13b2f14	Set version to v3.0.0a6	2020-08-09 21:53:32 +02:00
Ines Montani	a15c5fb191	Update docstrings and docs	2020-08-09 16:10:48 +02:00
Ines Montani	8d2baa153d	Update tokenizer docs and add test	2020-08-09 15:24:01 +02:00
Matthew Honnibal	134d933d67	Add docstring for entity linker factory	2020-08-09 15:19:28 +02:00
Matthew Honnibal	992ee1c02f	Update tagger docstring	2020-08-09 15:09:31 +02:00
Matthew Honnibal	ebf9a7acbf	Add textcat docstring	2020-08-09 15:07:09 +02:00
Matthew Honnibal	8a13f510d6	Update tests	2020-08-09 15:01:16 +02:00
Matthew Honnibal	bbd8acd4bf	Add docstrings for parser and NER. Simplify some arguments	2020-08-09 14:46:13 +02:00
Matthew Honnibal	39a3d64c01	Add docstrings for Tok2Vec component	2020-08-09 00:48:03 +02:00
Ines Montani	fd20f84927	Merge pull request #5895 from explosion/docs/batchers Draft docstrings for batchers	2020-08-07 20:07:10 +02:00
Matthew Honnibal	f5c4e0b751	Add docstrings for batchers	2020-08-07 18:51:02 +02:00
Ines Montani	fe29ceec9e	Merge branch 'develop' into docs/model-docstrings	2020-08-07 18:42:01 +02:00
Ines Montani	3a193eb8f1	Fix imports, types and default configs	2020-08-07 18:40:54 +02:00
Matthew Honnibal	b1d83fc13e	Fix imports	2020-08-07 16:55:54 +02:00
Matthew Honnibal	473504d837	Format	2020-08-07 16:49:00 +02:00
Matthew Honnibal	234c52a91e	Add tok2vec docstrings	2020-08-07 16:48:48 +02:00
Matthew Honnibal	547bc8a82b	Add docstring notes	2020-08-07 16:17:34 +02:00
Ines Montani	6f3649923c	Merge pull request #5893 from explosion/feature/validate-arg	2020-08-07 15:47:20 +02:00
Adriane Boyd	e962784531	Add Lemmatizer and simplify related components (#5848 ) * Add Lemmatizer and simplify related components * Add `Lemmatizer` pipe with `lookup` and `rule` modes using the `Lookups` tables. * Reduce `Tagger` to a simple tagger that sets `Token.tag` (no pos or lemma) * Reduce `Morphology` to only keep track of morph tags (no tag map, lemmatizer, or morph rules) * Remove lemmatizer from `Vocab` * Adjust many many tests Differences: * No default lookup lemmas * No special treatment of TAG in `from_array` and similar required * Easier to modify labels in a `Tagger` * No extra strings added from morphology / tag map * Fix test * Initial fix for Lemmatizer config/serialization * Adjust init test to be more generic * Adjust init test to force empty Lookups * Add simple cache to rule-based lemmatizer * Convert language-specific lemmatizers Convert language-specific lemmatizers to component lemmatizers. Remove previous lemmatizer class. * Fix French and Polish lemmatizers * Remove outdated UPOS conversions * Update Russian lemmatizer init in tests * Add minimal init/run tests for custom lemmatizers * Add option to overwrite existing lemmas * Update mode setting, lookup loading, and caching * Make `mode` an immutable property * Only enforce strict `load_lookups` for known supported modes * Move caching into individual `_lemmatize` methods * Implement strict when lang is not found in lookups * Fix tables/lookups in make_lemmatizer * Reallow provided lookups and allow for stricter checks * Add lookups asset to all Lemmatizer pipe tests * Rename lookups in lemmatizer init test * Clean up merge * Refactor lookup table loading * Add helper from `load_lemmatizer_lookups` that loads required and optional lookups tables based on settings provided by a config. Additional slight refactor of lookups: * Add `Lookups.set_table` to set a table from a provided `Table` * Reorder class definitions to be able to specify type as `Table` * Move registry assets into test methods * Refactor lookups tables config Use class methods within `Lemmatizer` to provide the config for particular modes and to load the lookups from a config. * Add pipe and score to lemmatizer * Simplify Tagger.score * Add missing import * Clean up imports and auto-format * Remove unused kwarg * Tidy up and auto-format * Update docstrings for Lemmatizer Update docstrings for Lemmatizer. Additionally modify `is_base_form` API to take `Token` instead of individual features. * Update docstrings * Remove tag map values from Tagger.add_label * Update API docs * Fix relative link in Lemmatizer API docs	2020-08-07 15:27:13 +02:00
Matthew Honnibal	da6e59519e	Add docstrings for simple_ner	2020-08-07 15:09:49 +02:00
Matthew Honnibal	7ef8a64df9	Add docstring for parser	2020-08-07 14:59:34 +02:00
Ines Montani	fc9a4fe827	Update attribute ruler	2020-08-07 14:43:55 +02:00
Ines Montani	a8404c3517	validation -> validate	2020-08-07 14:43:47 +02:00
Ines Montani	1d01d89b79	Update CLI docs and evaluate command [ci skip]	2020-08-07 14:40:58 +02:00
Ines Montani	ef2c67cca5	Add DocBin to/from_disk methods and update docs (#5892 ) * Add DocBin to/from_disk methods and update docs * Use DocBin.from_disk in Corpus	2020-08-07 14:30:59 +02:00
Ines Montani	4ca08c6d5d	Merge pull request #5891 from adrianeboyd/docs/attribute-ruler-api Add AttributeRuler API docs	2020-08-07 13:55:12 +02:00
Adriane Boyd	b8d0c23857	Add AttributeRuler API docs With additional minor updates to AttributeRuler docstrings.	2020-08-07 12:43:23 +02:00
svlandeg	b17db0e994	Merge remote-tracking branch 'upstream/develop' into feature/el-docs # Conflicts: # website/docs/usage/training.md	2020-08-06 19:48:52 +02:00
Adriane Boyd	06c3a5e048	Add pipe to AttributeRuler (#5889 )	2020-08-06 19:43:09 +02:00
Ines Montani	9b7f198390	Fix format	2020-08-06 19:30:53 +02:00
Ines Montani	3c4389110d	Remove unused imports	2020-08-06 19:30:47 +02:00
Matthew Honnibal	d4525816ef	Be less choosy about reporting textcat scores (#5879 ) * Set textcat scores more consistently * Refactor textcat scores * Fixes to scorer * Add comments * Add threshold * Rename just 'f' to micro_f in textcat scorer * Fix textcat score for two-class * Fix syntax * Fix textcat score * Fix docstring	2020-08-06 16:24:13 +02:00
svlandeg	0b4d1e1bc4	'debug data' instead of 'debug-data'	2020-08-06 15:47:31 +02:00
svlandeg	881e3f8fd0	add docbin explanation and example	2020-08-06 15:29:44 +02:00
Adriane Boyd	5e683a6e46	Fix return values for per feat score (#5885 ) * Fix return values for per feat score Convert `PRFScore` to dict as other per type scores. * Update tests accordingly	2020-08-06 15:14:47 +02:00
Ines Montani	913d21f0a3	Merge pull request #5882 from explosion/feature/raise-from Use "raise ... from" in custom errors for better tracebacks	2020-08-06 00:35:26 +02:00
Ines Montani	06e80d95cd	Sync develop with nightly docs state (#5883 ) Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>	2020-08-06 00:28:14 +02:00
Ines Montani	d92954ac1d	Merge pull request #5881 from explosion/feature/better-error-model-shortcuts	2020-08-06 00:13:35 +02:00
Ines Montani	56c17973aa	Use "raise ... from" in custom errors for better tracebacks	2020-08-05 23:53:21 +02:00
Ines Montani	5cc0d89fad	Simplify config overrides in CLI and deserialization (#5880 )	2020-08-05 23:35:09 +02:00
Ines Montani	0881455a5d	Update error message	2020-08-05 23:15:05 +02:00
Ines Montani	2a1fa86a0d	Add better error for failed model shortcut loading	2020-08-05 23:10:29 +02:00
Ines Montani	c675746ca2	Update docstrings and types	2020-08-05 20:29:46 +02:00
Ines Montani	823e533dc1	Add config callbacks for modifying nlp object before and after init (#5866 ) * WIP: Concept for modifying nlp object before and after init * Make callbacks return nlp object Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com> * Raise if callbacks don't return correct type * Rename, update types, add after_pipeline_creation Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-08-05 19:47:54 +02:00
Ines Montani	586d695775	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2020-08-05 16:01:11 +02:00
Ines Montani	e68459296d	Tidy up and auto-format	2020-08-05 16:00:59 +02:00
Matthew Honnibal	50c0e49741	Fix train CLI	2020-08-05 15:40:47 +02:00
Matthew Honnibal	b9df4d6116	Fix textcat.begin_training if vectors set	2020-08-05 15:40:36 +02:00
Adriane Boyd	4193402c47	Add warning when Matcher subpattern is discarded (#5873 ) * Add a warning when a subpattern is not processed and discarded * Normalize subpattern attribute/operator keys to upper case like top-level attributes	2020-08-05 14:56:14 +02:00
Adriane Boyd	af125875cf	Update SimpleNER (#5878 ) * Fix `get_loss` to use NER annotation * Add labels as part of cfg * Add simple overfitting test	2020-08-05 14:43:29 +02:00
Sofie Van Landeghem	b88c5c701a	Bugfix in nlp.replace_pipe (#5875 ) * bugfix and unit test * merge two conditions	2020-08-05 09:30:58 +02:00
Ines Montani	b795f02fbd	Allow adding pipeline components from source model (#5857 ) * Allow adding pipeline components from source model * Config: name -> component * Improve error messages * Fix error and test * Add frozen components and exclude logic * Remove exclude from Language.evaluate * Init sourced components with current vocab * Fix error codes	2020-08-04 23:39:19 +02:00
Sofie Van Landeghem	34873c4911	Example Dict format consistency (#5858 ) * consistently use upper-case IDS in token_annotation format and for get_aligned * remove ID from to_dict (not used in from_dict either) * fix test Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-08-04 22:22:26 +02:00
Adriane Boyd	fa79a0db9f	Add AttributeRuler for token attribute exceptions (#5842 ) * Add AttributeRuler for token attribute exceptions Add the `AttributeRuler` to handle exceptions for token-level attributes. The `AttributeRuler` uses `Matcher` patterns to identify target spans and applies the specified attributes to the token at the provided index in the matched span. A negative index can be used to index from the end of the matched span. The retokenizer is used to "merge" the individual tokens and assign them the provided attributes. Helper functions can import existing tag maps and morph rules to the corresponding `Matcher` patterns. There is an additional minor bug fix for `MORPH` attributes in the retokenizer to correctly normalize the values and to handle `MORPH` alongside `_` in an attrs dict. * Fix default name * Update name in error message * Extend AttributeRuler functionality * Add option to initialize with a dict of AttributeRuler patterns * Instead of silently discarding overlapping matches (the default behavior for the retokenizer if only the attrs differ), split the matches into disjoint sets and retokenize each set separately. This allows, for instance, one pattern to set the POS and another pattern to set the lemma. (If two matches modify the same attribute, it looks like the attrs are applied in the order they were added, but it may not be deterministic?) * Improve types * Sort spans before processing * Fix index boundaries in Span * Refactor retokenizer to separate attrs methods Add top-level `normalize_token_attrs` and `set_token_attrs` methods. * Update AttributeRuler to use refactored methods Update `AttributeRuler` to replace use of full retokenizer with only the relevant methods for normalizing and setting attributes for a single token. * Update spacy/pipeline/attributeruler.py Co-authored-by: Ines Montani <ines@ines.io> * Make API more similar to EntityRuler * Add `AttributeRuler.add_patterns` to add patterns from a list of dicts * Return list of dicts as property `AttributeRuler.patterns` * Make attrs_unnormed private * Add test loading patterns from assets * Revert "Fix index boundaries in Span" This reverts commit `8f8a5c3386`. * Add Span index boundary checks (#5861) * Add Span index boundary checks * Return Span-specific IndexError in all cases * Simplify and fix if/else Co-authored-by: Ines Montani <ines@ines.io>	2020-08-04 17:02:39 +02:00
Sofie Van Landeghem	492d1ec5de	Prevent alignment when texts don't match (#5867 ) * remove empty gold.pyx * add alignment unit test (to be used in docs) * ensure that Alignment is only used on equal texts * additional test using example.alignment * formatting Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-08-04 16:29:18 +02:00
Matthew Honnibal	ecb3c4e8f4	Create corpus iterator and batcher from registry during training (#5865 ) * Move batchers into their own module (and registry) * Update CLI * Update Corpus and batcher * Update tests * Update one config * Merge 'evaluation' block back under [training] * Import batchers in gold __init__ * Fix batchers * Update config * Update schema * Update util * Don't assume train and dev are actually paths * Update onto-joint config * Fix missing import * Format * Format * Update spacy/gold/corpus.py Co-authored-by: Ines Montani <ines@ines.io> * Fix name * Update default config * Fix get_length option in batchers * Update test * Add comment * Pass path into Corpus * Update docstring * Update schema and configs * Update config * Fix test * Fix paths * Fix print * Fix create_train_batches * [training.read_train] -> [training.train_corpus] * Update onto-joint config Co-authored-by: Ines Montani <ines@ines.io>	2020-08-04 15:09:37 +02:00
Sofie Van Landeghem	82347110f5	Default empty KB in EL component (#5872 ) * EL field documentation * documentation consistent with docs * default empty KB, initialize vocab separately * formatting * add test for changing the default entity vector length * update comment	2020-08-04 14:34:09 +02:00
Adriane Boyd	b7e3018d97	Recalculate alignment if tokenization differs (#5868 ) * Recalculate alignment if tokenization differs * Refactor cached alignment data	2020-08-04 14:31:32 +02:00
Adriane Boyd	c62fd878a3	Allow Doc.char_span to snap to token boundaries (#5849 ) * Allow Doc.char_span to snap to token boundaries Add a `mode` option to allow `Doc.char_span` to snap to token boundaries. The `mode` options: * `strict`: character offsets must match token boundaries (default, same as before) * `inside`: all tokens completely within the character span * `outside`: all tokens at least partially covered by the character span Add a new helper function `token_by_char` that returns the token corresponding to a character position in the text. Update `token_by_start` and `token_by_end` to use `token_by_char` for more efficient searching. * Remove unused import * Rename mode to alignment_mode Rename `mode` to `alignment_mode` with the options `strict`/`contract`/`expand`. Any unrecognized modes are silently converted to `strict`.	2020-08-04 13:36:32 +02:00
Adriane Boyd	b841248589	Add Span index boundary checks (#5861 ) * Add Span index boundary checks * Return Span-specific IndexError in all cases * Simplify and fix if/else	2020-08-04 13:35:25 +02:00
Adriane Boyd	cd59979ab4	Fix span boundary handling in Spanish noun_chunks (#5860 )	2020-08-03 13:53:15 +02:00
Ines Montani	934447a611	Merge pull request #5855 from svlandeg/fix/cli-debug	2020-08-03 13:09:20 +02:00
Ines Montani	4c055f0aa7	Add init CLI and init config (#5854 ) * Add init CLI and init config draft * Improve config validation * Auto-format * Don't export anything in debug config * Update docs	2020-08-02 15:18:30 +02:00
svlandeg	6f4e46ee93	Merge remote-tracking branch 'upstream/develop' into fix/cli-debug # Conflicts: # pyproject.toml # requirements.txt # setup.cfg	2020-08-01 18:38:59 +02:00
Ines Montani	b40f44419b	Simplify pipe analysis - remove unused code - don't print by default - integrate attrs info into analysis output	2020-08-01 13:40:06 +02:00
Ines Montani	b68c53858c	Remove global	2020-07-31 18:37:58 +02:00
Ines Montani	30a76fcf6f	Integrate and simplify pipe analysis	2020-07-31 18:34:35 +02:00
svlandeg	9b719dfb1a	use divider inbetween steps	2020-07-31 18:06:48 +02:00
svlandeg	51ffc4a166	rename pipe_name to component	2020-07-31 17:58:55 +02:00
svlandeg	878327d38e	printing final predictions by default to False	2020-07-31 17:36:32 +02:00
Ines Montani	2d955fbf98	Fix linting [ci skip]	2020-07-31 17:05:28 +02:00
Ines Montani	e9e8fa2466	Update docs and types	2020-07-31 17:02:54 +02:00
svlandeg	cc2f58a1b0	use data_validation context manager	2020-07-31 16:49:42 +02:00
Adriane Boyd	ac14ce7c30	Prefer earlier spans in EntityRuler (#5843 ) Similar to #4414, update the sorting in EntityRuler to prefer the first span in overlapping spans.	2020-07-31 16:09:32 +02:00
svlandeg	5fa3235d06	set DATA_VALIDATION to False for debug_model (upgrade thinc)	2020-07-31 15:21:01 +02:00
svlandeg	08d3c36c20	bugfix in train CLI	2020-07-31 15:03:43 +02:00
Adriane Boyd	9b509aa87f	Move Language.evaluate scorer config to new arg Move `Language.evaluate` scorer config from `component_cfg` to separate argument `scorer_cfg`.	2020-07-31 11:05:16 +02:00
Adriane Boyd	901801b33b	Fix default arguments in DependencyParser.score	2020-07-31 10:55:44 +02:00
Adriane Boyd	9d79916792	Merge branch 'develop' into feature/scorer-adjustments	2020-07-31 10:48:14 +02:00
Sofie Van Landeghem	ca491722ad	The Parser is now a Pipe (2) (#5844 ) * moving syntax folder to _parser_internals * moving nn_parser and transition_system * move nn_parser and transition_system out of internals folder * moving nn_parser code into transition_system file * rename transition_system to transition_parser * moving parser_model and _state to ml * move _state back to internals * The Parser now inherits from Pipe! * small code fixes * removing unnecessary imports * remove link_vectors_to_models * transition_system to internals folder * little bit more cleanup * newlines	2020-07-30 23:30:54 +02:00
svlandeg	0b23594953	pipe_name instead of section in debug_model	2020-07-30 20:06:28 +02:00
Rahul Gupta	f76fae0e8d	English: adds ordinal numbers (#5830 )	2020-07-29 20:22:47 +02:00
Ines Montani	7a21775cd0	Merge pull request #5834 from explosion/feature/vectors	2020-07-29 18:49:26 +02:00
Gustavo Zadrozny Leyendecker	90b958fd01	Fix on EntityRendered to support break lines (after last entity) (closes #5838 )	2020-07-29 18:48:39 +02:00
Ines Montani	b0f57a0cac	Update docs and consistency	2020-07-29 15:14:07 +02:00
Matthew Honnibal	a2d573c039	Merge branch 'feature/vectors' of https://github.com/explosion/spaCy into feature/vectors	2020-07-29 14:56:27 +02:00
Matthew Honnibal	2af741d7e3	Fix train arg	2020-07-29 14:56:01 +02:00
Matthew Honnibal	c27309f839	Merge branch 'develop' into feature/vectors	2020-07-29 14:54:10 +02:00
Ines Montani	62266fb828	Fix broken type annotation	2020-07-29 14:49:49 +02:00
Matthew Honnibal	142b58be92	Fix import	2020-07-29 14:45:09 +02:00
Matthew Honnibal	c99a653070	Adjust textcat model	2020-07-29 14:38:15 +02:00
Matthew Honnibal	9e1b11dd81	Update vectors in textcat	2020-07-29 14:35:36 +02:00
Matthew Honnibal	105cf29967	Fix DocBin	2020-07-29 14:23:13 +02:00
Ines Montani	ff0bc05da8	Fix docstrings [ci skip]	2020-07-29 14:09:37 +02:00
Ines Montani	6e2623d3f8	Fix docstring [ci skip]	2020-07-29 14:08:05 +02:00
Ines Montani	8d56260d92	Fix docstrings [ci skip]	2020-07-29 14:07:13 +02:00
Ines Montani	80b18124d2	Fix docstring [ci skip]	2020-07-29 14:03:35 +02:00
Matthew Honnibal	f0cf4a2dca	Update tests	2020-07-29 14:01:14 +02:00
Matthew Honnibal	07b47eaac8	Update tok2vec layer	2020-07-29 14:01:13 +02:00
Matthew Honnibal	5ae8628571	Fix CharacterEmbed layer	2020-07-29 14:01:13 +02:00
Matthew Honnibal	97d3651574	Fix stray link_vectors_to_models call	2020-07-29 14:01:13 +02:00
Matthew Honnibal	c7d1ece3eb	Update tests	2020-07-29 14:01:13 +02:00
Matthew Honnibal	00de30bcc2	Update CharacterEmbed function	2020-07-29 14:01:12 +02:00
Matthew Honnibal	6a6b09bd32	Update morphologizer model	2020-07-29 14:01:12 +02:00
Matthew Honnibal	20e9098e3f	Update tests	2020-07-29 14:01:12 +02:00
Matthew Honnibal	c35d6282fc	Add previous HashEmbedCNN tok2vec to make transition easier	2020-07-29 14:01:12 +02:00
Matthew Honnibal	1784c95827	Clean up link_vectors_to_models unused stuff	2020-07-29 14:01:11 +02:00
Matthew Honnibal	0c17ea4c85	Format	2020-07-29 14:00:13 +02:00
Matthew Honnibal	2aff3c4b5a	Load vectors in 'spacy train'	2020-07-29 14:00:13 +02:00
Matthew Honnibal	7852a68a75	Fix load_vectors_into_model function	2020-07-29 14:00:13 +02:00
Matthew Honnibal	7299419fe4	Dont load vectors in Language.from_config	2020-07-29 14:00:12 +02:00
Matthew Honnibal	30dd96c540	Load vectors in Language.from_config	2020-07-29 14:00:12 +02:00
Matthew Honnibal	df95e2af64	Add load_vectors_into_model util	2020-07-29 14:00:12 +02:00
Matthew Honnibal	475d7c1c7c	Fix StaticVectors class	2020-07-29 14:00:11 +02:00
Matthew Honnibal	44d350dc94	Use spaCy's StaticVectors	2020-07-29 14:00:11 +02:00
Matthew Honnibal	acc64e138a	Add import	2020-07-29 14:00:11 +02:00
Matthew Honnibal	9987ea9e4d	Fix Tok2Vec begin_training	2020-07-29 14:00:10 +02:00
Matthew Honnibal	099e9331c5	Fix tok2vec	2020-07-29 14:00:10 +02:00
Matthew Honnibal	fe0cdcd461	Fixes	2020-07-29 14:00:09 +02:00
Matthew Honnibal	123f8b832d	Refactor Tok2Vec model	2020-07-29 14:00:09 +02:00
Matthew Honnibal	c6b4f63c7c	Remove obsolete function	2020-07-29 14:00:09 +02:00
Matthew Honnibal	9cc7262224	Draft StaticVectors layer	2020-07-29 14:00:09 +02:00
Matthew Honnibal	cb9654e98c	WIP on new StaticVectors	2020-07-29 14:00:09 +02:00

... 2 3 4 5 6 ...

7694 Commits