spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-12-04 08:44:23 +03:00

Author	SHA1	Message	Date
Ines Montani	8cb7f9ccff	Improve assets and DVC handling (#5719 ) * Improve assets and DVC handling * Remove outdated comment [ci skip]	2020-07-07 20:51:50 +02:00
Sofie Van Landeghem	a39a110c4e	Few more Example unit tests (#5720 ) * small fixes in Example, UX * add gold tests for aligned_spans and get_aligned_parse * sentencizer unnecessary	2020-07-07 18:46:00 +02:00
Matthw Honnibal	433dc3c9c9	Simplify PrecomputableAffine slightly	2020-07-07 17:22:47 +02:00
Matthw Honnibal	a4164f67ca	Don't normalize gradients	2020-07-07 17:21:58 +02:00
Matthw Honnibal	8177f25b6c	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2020-07-07 17:21:10 +02:00
Ines Montani	fa00a85828	Merge pull request #5715 from explosion/chore/tidy-regression-tests	2020-07-07 11:22:07 +02:00
Matthw Honnibal	d1fd3438c3	Add dropout to parser hidden layer	2020-07-07 01:38:15 +02:00
Matthw Honnibal	f25761e513	Dont randomize cuts in parser	2020-07-06 17:51:25 +02:00
Matthw Honnibal	709fc5e4ad	Clarify dropout and seed in Tok2Vec	2020-07-06 17:50:21 +02:00
Matthew Honnibal	19d42f42de	Set version to v3.0.0a2	2020-07-06 17:43:12 +02:00
Matthew Honnibal	cc477be952	Improve gold-standard alignment (#5711 ) * Remove previous alignment * Implement better alignment, using ragged data structure * Use pytokenizations for alignment * Fixes * Fixes * Fix overlapping entities in alignment * Fix align split_sents * Update test * Commit align.py * Try to appease setuptools * Fix flake8 * use realistic entities for testing * Update tests for better alignment * Improve alignment heuristic Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>	2020-07-06 17:39:31 +02:00
Mike Izbicki	7a2ca00794	fix bug in Korean language, resulting in 100x speedup by reducing overhead of mecab (#5701 ) * speed up Korean nlp 100x by stopping mecab from reloading on each doc * add contributor agreement * rename variables to improve code readability	2020-07-06 17:03:33 +02:00
Ines Montani	b6deef80f8	Fix class to pickling works as expected	2020-07-06 16:43:45 +02:00
Ines Montani	fa261d09e8	Add alternative CLI option	2020-07-06 15:57:38 +02:00
Adriane Boyd	c67fc6aa5b	Make `docs_to_json` backwards-compatible with v2 (#5714 ) * In `spacy convert -t json` output the JSON docs wrapped in a list * Add back token-level `ner` alongside the doc-level `entities`	2020-07-06 14:15:00 +02:00
Ines Montani	5b7b2a498d	Tidy up and merge regression tests	2020-07-06 14:05:59 +02:00
Ines Montani	412dbb1f38	Remove dead and/or deprecated code (#5710 ) * Remove dead and/or deprecated code * Remove n_threads Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-07-06 13:06:25 +02:00
Sofie Van Landeghem	fcbf899b08	Feature/example only (#5707 ) * remove _convert_examples * fix test_gold, raise TypeError if tuples are used instead of Example's * throwing proper errors when the wrong type of objects are passed * fix deprectated format in tests * fix deprectated format in parser tests * fix tests for NEL, morph, senter, tagger, textcat * update regression tests with new Example format * use make_doc * more fixes to nlp.update calls * few more small fixes for rehearse and evaluate * only import ml_datasets if really necessary	2020-07-06 13:02:36 +02:00
graue70	9860b8399e	Fix typo in test function docstring (#5696 )	2020-07-05 15:49:06 +02:00
Matthew Honnibal	3e78e82a83	Experimental character-based pretraining (#5700 ) * Use cosine loss in Cloze multitask * Fix char_embed for gpu * Call resume_training for base model in train CLI * Fix bilstm_depth default in pretrain command * Implement character-based pretraining objective * Use chars loss in ClozeMultitask * Add method to decode predicted characters * Fix number characters * Rescale gradients for mlm * Fix char embed+vectors in ml * Fix pipes * Fix pretrain args * Move get_characters_loss * Fix import * Fix import * Mention characters loss option in pretrain * Remove broken 'self attention' option in pretrain * Revert "Remove broken 'self attention' option in pretrain" This reverts commit `56b820f6af`. * Document 'characters' objective of pretrain	2020-07-05 15:48:39 +02:00
Matthw Honnibal	3f6f087113	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2020-07-04 23:52:12 +02:00
Matthw Honnibal	5642507823	Fix has_unknown_spaces in Doc.copy	2020-07-04 23:52:02 +02:00
Matthw Honnibal	8870a6ded7	Specify seeds in HashEmbed	2020-07-04 23:51:49 +02:00
Ines Montani	37c3bb35e2	Auto-format	2020-07-04 16:25:34 +02:00
Ines Montani	abd173937f	Auto-format and update URL	2020-07-04 14:23:44 +02:00
Ines Montani	99aff16d60	Make argument shortcut consistent	2020-07-04 14:23:32 +02:00
Matthew Honnibal	2bd1bf81f1	Refactor pretrain and support character-based objective for v3 (#5706 ) * Start adding character-based stuff * Start adding character-based objective * Start adding character-based stuff * Start adding character-based objective * Remove outdated comment * Update pretraining models * Add/fix character-based multi-task models * Refactor pretrain and support character-based objective * Update pretrain config * Remove unused * Fix flake8 errors * Clean up imports * Format * Format * Update Thinc version * Raise error if vectors objective but no vectors	2020-07-03 17:57:28 +02:00
Ines Montani	84fb3a3fb3	Auto-format and fix tuple	2020-07-03 15:20:10 +02:00
Adriane Boyd	86d13a9fb8	Set version to 2.3.1 (#5705 )	2020-07-03 13:38:41 +02:00
Matthew Honnibal	e1b3e8ee11	Set version to v3.0.0a1	2020-07-03 13:21:08 +02:00
Matthew Honnibal	a902b5f217	Record whether Doc objects are built from known spacing (#5697 ) * Tell convert CLI to store user data for Doc * Remove assert * Add has_unknwon_spaces flag on Doc * Do not tokenize docs with unknown spaces in Corpus * Handle conversion of unknown spaces in Example * Fixes * Fixes * Draft has_known_spaces support in DocBin * Add test for serialize has_unknown_spaces * Fix DocBin serialization when has_unknown_spaces * Use serialization in test	2020-07-03 12:58:16 +02:00
Adriane Boyd	abad56db7d	Add conllu2docs converter (#5704 ) Add conllu2docs converter adapted from conllu2json converter	2020-07-03 12:54:32 +02:00
Jan Jessewitsch	e4dcac4a4b	Merging multiple docs into one (#5032 ) * Add static method to Doc to allow merging of multiple docs. * Add error description for the error that occurs if docs with different vocabs (from different languages) are merged in Doc.from_docs(). * Add test for Doc.from_docs() implementation. * Fix using numpy's concatenate in Doc.from_docs. * Replace typing's type annotations in from_docs. * Simply remove type annotations in from_docs. * Add documentation for Doc.from_docs to api. * Simplify from_docs, its test and the api doc for codebase consistency. * Fix merging of Doc objects that end with whitespaces (Achieved by simply not setting the SPACY attribute on whitespace tokens). Remove two unnecessary imports of attributes. * Add merging of user data from Doc objects in from_docs. Add user data test case to corresponding test. Add applicable warning messages. * Fix incorrect setting of tokens idx by using concatenated spaces (again). Add test case to corresponding test. * Add MORPH to attrs * Update warnings calls * Remove out-dated error from merge * Rename space_delimiter to ensure_whitespace Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2020-07-03 11:32:42 +02:00
Sofie Van Landeghem	41b65fd0f8	fix to pretrain script (#5699 ) * fix to pretrain script * remove unnecessary import	2020-07-02 21:48:01 +02:00
Adriane Boyd	a723fa02a1	DocBin: add version number, missing attributes and strings (#5685 ) * Add version number to DocBin Add a version number to DocBin for future use. * Add POS to all attributes in DocBin * Add morph string to strings in DocBin * Update DocBin API * Add string for ENT_KB_ID in DocBin	2020-07-02 17:41:50 +02:00
Adriane Boyd	a77c4c3465	Add strings and ENT_KB_ID to Doc serialization (#5691 ) * Add strings for all writeable Token attributes to `Doc.to/from_bytes()`. * Add ENT_KB_ID to default attributes.	2020-07-02 17:11:57 +02:00
Adriane Boyd	971826a96d	Include git commit in package and model meta (#5694 ) * Include git commit in package and model meta * Rewrite to read file in setup * Fix file handle	2020-07-02 17:10:27 +02:00
Ines Montani	d36632553a	Merge pull request #5688 from explosion/remove-deprecated Remove deprecated methods: Doc.print_tree, Doc.merge, Span.merge	2020-07-02 15:10:30 +02:00
Ines Montani	8a5b9a6d5f	Merge pull request #5693 from svlandeg/bugfix/nel-v3	2020-07-02 14:45:46 +02:00
Ines Montani	ee8a830248	Merge pull request #5687 from svlandeg/bugfix/init-model Fixing init_model	2020-07-02 14:10:28 +02:00
svlandeg	04ed4d60a8	raise error when links are not aligned to tokens	2020-07-02 13:57:35 +02:00
svlandeg	f503817623	fix parsing entity links in new gold format	2020-07-02 13:48:11 +02:00
Ines Montani	60c2695131	Remove deprecated methods	2020-07-01 22:33:39 +02:00
Ines Montani	fe4cfd0632	Start updating website for v3 [ci skip]	2020-07-01 21:26:39 +02:00
svlandeg	a30bc77415	bugfixing prune_vectors and vectors_loc	2020-07-01 21:00:47 +02:00
Matthw Honnibal	94a0cf46fd	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2020-07-01 18:45:45 +02:00
Matthw Honnibal	6a0a27e5c2	Fix max_steps	2020-07-01 18:08:14 +02:00
Ines Montani	8d90e44d74	Fix title	2020-07-01 15:38:01 +02:00
Ines Montani	8fb574900a	Update parent package and version	2020-07-01 15:35:23 +02:00
Matthew Honnibal	0ada186dda	Set version to v3.0.0.dev14	2020-07-01 15:31:04 +02:00
Matthw Honnibal	cb51bb637b	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2020-07-01 15:17:27 +02:00
Matthw Honnibal	7734cbc34d	Set batch size in begin_training	2020-07-01 15:16:59 +02:00
Matthw Honnibal	1f7709e9a6	Improve max length check in corpus	2020-07-01 15:16:43 +02:00
Matthw Honnibal	2fa56484b2	Fix eval batch size	2020-07-01 15:16:25 +02:00
Matthw Honnibal	c5d12d1a22	Allow batch size to be set for evaluation in spacy train	2020-07-01 15:04:36 +02:00
Matthw Honnibal	f5532757a3	Filter out 0-length examples in Corpus	2020-07-01 15:02:37 +02:00
Ines Montani	bc87ba97e0	Merge pull request #5681 from svlandeg/bugfix/exec-cwd	2020-07-01 14:13:19 +02:00
Matthw Honnibal	52338a07bb	Set version to v3.0.0.dev13	2020-07-01 02:49:17 +02:00
Matthw Honnibal	fa6d473390	Fix parser maxout_pieces=1	2020-07-01 02:48:58 +02:00
Matthw Honnibal	35af5819e0	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2020-07-01 01:03:39 +02:00
Matthw Honnibal	0d6edf5397	Clean up debug code in transition_system	2020-07-01 01:03:20 +02:00
Matthw Honnibal	a1b6add4c8	Fix parser gold cutting and gradient normalization	2020-07-01 01:02:58 +02:00
Matthw Honnibal	8c5a88e777	Fix per-epoch shuffling	2020-07-01 01:02:35 +02:00
svlandeg	a7d547c65e	small fix	2020-06-30 21:56:17 +02:00
svlandeg	8eca7e995e	add try-except to git commands to get an informative warning	2020-06-30 21:53:40 +02:00
Ines Montani	b032943c34	Fix funny printing again	2020-06-30 21:33:41 +02:00
Matthw Honnibal	d525552979	Fix efficiency of parser backprop_nonlinearity	2020-06-30 21:22:54 +02:00
Ines Montani	d64644d9d1	Adjust auto-formatting	2020-06-30 20:36:30 +02:00
Ines Montani	6da3500728	Fix command substitution	2020-06-30 20:35:51 +02:00
svlandeg	e7aff9c5fc	bugfix exec usage in dvc.yaml	2020-06-30 18:51:20 +02:00
svlandeg	60f97bc519	add custom warning when run_command fails	2020-06-30 17:28:43 +02:00
svlandeg	39953c7c60	fix print_run_help with new arg order	2020-06-30 17:28:09 +02:00
svlandeg	cd632d8ec2	move folder for exec argument one up	2020-06-30 17:19:36 +02:00
svlandeg	1ae6fa2554	move subcommand one place up as project_dir has default	2020-06-30 16:04:53 +02:00
svlandeg	a46b76f188	use current working dir as default throughout	2020-06-30 15:39:24 +02:00
svlandeg	b228111925	fix funny printing	2020-06-30 14:54:45 +02:00
Ines Montani	8e20505970	Resolve within working_dir context manager	2020-06-30 13:29:45 +02:00
Ines Montani	72175b5c60	Update project command	2020-06-30 13:17:26 +02:00
Ines Montani	c5e31acb06	Make working_dir yield absolute cwd path	2020-06-30 13:17:14 +02:00
Ines Montani	3aca404735	Make run_command take string and list	2020-06-30 13:17:00 +02:00
Ines Montani	7584fdafec	Fix typo	2020-06-30 12:59:13 +02:00
svlandeg	140c4896a0	split_command util function	2020-06-30 12:54:15 +02:00
Matthw Honnibal	57e09747dc	Improve efficiency of get_oracle_sequences	2020-06-30 11:50:48 +02:00
Matthw Honnibal	233945bfe0	Fix init for padding	2020-06-30 11:50:24 +02:00
svlandeg	d23be563eb	remove redundant setting of no_args_is_help	2020-06-30 11:23:35 +02:00
svlandeg	b311ce982f	Merge remote-tracking branch 'upstream/develop' into fix/small-edits # Conflicts: # spacy/cli/project.py	2020-06-30 11:17:31 +02:00
svlandeg	7e4cbda89a	fix project_init for relative path	2020-06-30 11:09:53 +02:00
Matthw Honnibal	85ed5730a2	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2020-06-30 01:14:16 +02:00
Ines Montani	e8033df81e	Also handle python3 and pip3	2020-06-29 20:30:42 +02:00
Ines Montani	c874dde66c	Show help on "spacy project"	2020-06-29 20:11:34 +02:00
Ines Montani	1d2c646e57	Fix init and remove .dvc/plots	2020-06-29 20:07:21 +02:00
Matthw Honnibal	5bed6fc431	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2020-06-29 19:55:24 +02:00
svlandeg	1176783310	fix one more shlex.split	2020-06-29 18:37:42 +02:00
svlandeg	ff233d5743	print details on error msg (e.g. PermissionError on specific file)	2020-06-29 18:22:33 +02:00
svlandeg	894b8e7ff6	throw warning (instead of crashing) when temp dir can't be cleaned	2020-06-29 18:16:39 +02:00
svlandeg	efe7eb71f2	create subfolder in working dir	2020-06-29 17:46:08 +02:00
svlandeg	3487214ba1	fix shlex.split for non-posix	2020-06-29 17:45:47 +02:00
Ines Montani	126050f259	Improve asset fetching Get all paths first and run dvc add once so it only shows one progress bar and one combined git command (if repo is git repo)	2020-06-29 16:55:24 +02:00
Ines Montani	7c08713baa	Improve error messages	2020-06-29 16:54:47 +02:00
Ines Montani	24664efa23	Import project_run_all function	2020-06-29 16:54:19 +02:00
svlandeg	f8dddeda27	print help msg when just calling 'project' without args	2020-06-29 16:38:15 +02:00
svlandeg	bf43ebbf61	fix typo's	2020-06-29 16:32:25 +02:00
Matthew Honnibal	67928036f2	Set version to v3.0.0.dev12	2020-06-29 14:45:43 +02:00
Matthew Honnibal	2d715451a2	Revert "Convert custom user_data to token extension format for Japanese tokenizer (#5652 )" (#5665 ) This reverts commit `1dd38191ec`.	2020-06-29 14:34:15 +02:00
Sofie Van Landeghem	8d3c0306e1	refactor fixes (#5664 ) * fixes in ud_train, UX for morphs * update pyproject with new version of thinc * fixes in debug_data script * cleanup of old unused error messages * remove obsolete TempErrors * move error messages to errors.py * add ENT_KB_ID to default DocBin serialization * few fixes to simple_ner * fix tags	2020-06-29 14:33:00 +02:00
Adriane Boyd	1dd38191ec	Convert custom user_data to token extension format for Japanese tokenizer (#5652 ) * Convert custom user_data to token extension format Convert the user_data values so that they can be loaded as custom token extensions for `inflection`, `reading_form`, `sub_tokens`, and `lemma`. * Reset Underscore state in ja tokenizer tests	2020-06-29 14:20:26 +02:00
Adriane Boyd	167df42cb6	Move lemmatizer is_base_form to language settings (#5663 ) Move `Lemmatizer.is_base_form` to the language settings so that each language can provide a language-specific method as `LanguageDefaults.is_base_form`. The existing English-specific `Lemmatizer.is_base_form` is moved to `EnglishDefaults`.	2020-06-29 14:16:57 +02:00
Sofie Van Landeghem	fc3cb1fa9e	NER align tests (#5656 ) * one_to_man works better. misalignment doesn't yet. * fix tests * restore example * xfail alignment tests	2020-06-29 13:59:17 +02:00
Matthew Honnibal	2d9604d39c	Set version to v3.0.0.dev11	2020-06-29 13:56:46 +02:00
Matthw Honnibal	da50473701	Tweak efficiency of arc_eager.set_costs	2020-06-29 12:17:41 +02:00
Ines Montani	bac8a8d766	Merge branch 'feature/project-cli' into develop	2020-06-29 10:49:05 +02:00
Matthew Honnibal	e14bf9decb	Set version to v3.0.0.dev9	2020-06-28 23:58:10 +02:00
Matthew Honnibal	58c8f731bd	Set version to v3.0.0.dev9	2020-06-28 23:53:14 +02:00
Ines Montani	569376e34e	Replace curl with requests	2020-06-28 16:25:53 +02:00
Ines Montani	dbe86b3453	Update project.py	2020-06-28 15:45:19 +02:00
Ines Montani	dbfa292ed3	Output more stats in evaluate	2020-06-28 15:34:28 +02:00
Ines Montani	90b7fa8fed	Run DVC command in project dir	2020-06-28 15:33:53 +02:00
Ines Montani	2f6ee0d018	Tidy up, document and add custom clone logic	2020-06-28 15:08:35 +02:00
Matthew Honnibal	dc7a9be9f8	Merge branch 'feature/project-cli' of https://github.com/explosion/spaCy into feature/project-cli	2020-06-28 14:07:53 +02:00
Matthew Honnibal	e08257d401	Add example of how to do sparse-checkout	2020-06-28 14:07:32 +02:00
Ines Montani	1b331237aa	Update hashing and config update	2020-06-28 13:17:19 +02:00
Ines Montani	f385344286	Update asset logic and add import-url	2020-06-28 13:07:31 +02:00
Ines Montani	d6aa4cb478	Update asset logic	2020-06-28 12:40:11 +02:00
Ines Montani	ed46951842	Update	2020-06-28 12:24:59 +02:00
Ines Montani	d54f33441a	Merge branch 'feature/project-cli' of https://github.com/explosion/spaCy into feature/project-cli	2020-06-27 21:17:00 +02:00
Ines Montani	cd0dd78276	Simplify model loading (now supported via load_model)	2020-06-27 21:16:57 +02:00
Matthew Honnibal	8e3baebdce	Merge branch 'feature/project-cli' of https://github.com/explosion/spaCy into feature/project-cli	2020-06-27 21:16:18 +02:00
Matthew Honnibal	d8c70b415e	Fix Example usage in evaluate	2020-06-27 21:15:25 +02:00
Ines Montani	e33d2b1bea	Add success message	2020-06-27 21:15:13 +02:00
Ines Montani	42eb381ec6	Improve output handling in evaluate	2020-06-27 21:13:11 +02:00
Ines Montani	df22d490b1	Tidy up types	2020-06-27 21:13:06 +02:00
Ines Montani	6678bd80c2	Check if deps exist in non-DVC commands	2020-06-27 20:57:26 +02:00
Ines Montani	fe06697150	Fix package command and add version option	2020-06-27 20:36:08 +02:00
Ines Montani	165c37ccba	Update project.py	2020-06-27 15:03:21 +02:00
Ines Montani	8979dc254f	Update project init	2020-06-27 14:40:28 +02:00
Ines Montani	c96b4a37b6	Update DVC integration	2020-06-27 14:15:41 +02:00
Ines Montani	7a0fe50610	Merge branch 'develop' into feature/project-cli	2020-06-27 13:03:03 +02:00
Ines Montani	8b305253d3	Update with DVC WIP	2020-06-27 13:02:10 +02:00
Matthw Honnibal	4ff9a837fc	Fix _fix_legacy_dict_data in Example	2020-06-26 23:46:18 +02:00
Matthw Honnibal	1d672e0c12	Revert "attempt to fix _guess_spaces" This reverts commit `5b6ed05752`.	2020-06-26 23:42:41 +02:00
Matthew Honnibal	8c29268749	Improve spacy.gold (no GoldParse, no json format!) (#5555 ) * Update errors * Remove beam for now (maybe) Remove beam_utils Update setup.py Remove beam * Remove GoldParse WIP on removing goldparse Get ArcEager compiling after GoldParse excise Update setup.py Get spacy.syntax compiling after removing GoldParse Rename NewExample -> Example and clean up Clean html files Start updating tests Update Morphologizer * fix error numbers * fix merge conflict * informative error when calling to_array with wrong field * fix error catching * fixing language and scoring tests * start testing get_aligned * additional tests for new get_aligned function * Draft create_gold_state for arc_eager oracle * Fix import * Fix import * Remove TokenAnnotation code from nonproj * fixing NER one-to-many alignment * Fix many-to-one IOB codes * fix test for misaligned * attempt to fix cases with weird spaces * fix spaces * test_gold_biluo_different_tokenization works * allow None as BILUO annotation * fixed some tests + WIP roundtrip unit test * add spaces to json output format * minibatch utiltiy can deal with strings, docs or examples * fix augment (needs further testing) * various fixes in scripts - needs to be further tested * fix test_cli * cleanup * correct silly typo * add support for MORPH in to/from_array, fix morphologizer overfitting test * fix tagger * fix entity linker * ensure test keeps working with non-linked entities * pipe() takes docs, not examples * small bug fix * textcat bugfix * throw informative error when running the components with the wrong type of objects * fix parser tests to work with example (most still failing) * fix BiluoPushDown parsing entities * small fixes * bugfix tok2vec * fix renames and simple_ner labels * various small fixes * prevent writing dummy values like deps because that could interfer with sent_start values * fix the fix * implement split_sent with aligned SENT_START attribute * test for split sentences with various alignment issues, works * Return ArcEagerGoldParse from ArcEager * Update parser and NER gold stuff * Draft new GoldCorpus class * add links to to_dict * clean up * fix test checking for variants * Fix oracles * Start updating converters * Move converters under spacy.gold * Move things around * Fix naming * Fix name * Update converter to produce DocBin * Update converters * Allow DocBin to take list of Doc objects. * Make spacy convert output docbin * Fix import * Fix docbin * Fix compile in ArcEager * Fix import * Serialize all attrs by default * Update converter * Remove jsonl converter * Add json2docs converter * Draft Corpus class for DocBin * Work on train script * Update Corpus * Update DocBin * Allocate Doc before starting to add words * Make doc.from_array several times faster * Update train.py * Fix Corpus * Fix parser model * Start debugging arc_eager oracle * Update header * Fix parser declaration * Xfail some tests * Skip tests that cause crashes * Skip test causing segfault * Remove GoldCorpus * Update imports * Update after removing GoldCorpus * Fix module name of corpus * Fix mimport * Work on parser oracle * Update arc_eager oracle * Restore ArcEager.get_cost function * Update transition system * Update test_arc_eager_oracle * Remove beam test * Update test * Unskip * Unskip tests * add links to to_dict * clean up * fix test checking for variants * Allow DocBin to take list of Doc objects. * Fix compile in ArcEager * Serialize all attrs by default Move converters under spacy.gold Move things around Fix naming Fix name Update converter to produce DocBin Update converters Make spacy convert output docbin Fix import Fix docbin Fix import Update converter Remove jsonl converter Add json2docs converter * Allocate Doc before starting to add words * Make doc.from_array several times faster * Start updating converters * Work on train script * Draft Corpus class for DocBin Update Corpus Fix Corpus * Update DocBin Add missing strings when serializing * Update train.py * Fix parser model * Start debugging arc_eager oracle * Update header * Fix parser declaration * Xfail some tests Skip tests that cause crashes Skip test causing segfault * Remove GoldCorpus Update imports Update after removing GoldCorpus Fix module name of corpus Fix mimport * Work on parser oracle Update arc_eager oracle Restore ArcEager.get_cost function Update transition system * Update tests Remove beam test Update test Unskip Unskip tests * Add get_aligned_parse method in Example Fix Example.get_aligned_parse * Add kwargs to Corpus.dev_dataset to match train_dataset * Update nonproj * Use get_aligned_parse in ArcEager * Add another arc-eager oracle test * Remove Example.doc property Remove Example.doc Remove Example.doc Remove Example.doc Remove Example.doc * Update ArcEager oracle Fix Break oracle * Debugging * Fix Corpus * Fix eg.doc * Format * small fixes * limit arg for Corpus * fix test_roundtrip_docs_to_docbin * fix test_make_orth_variants * fix add_label test * Update tests * avoid writing temp dir in json2docs, fixing 4402 test * Update test * Add missing costs to NER oracle * Update test * Work on Example.get_aligned_ner method * Clean up debugging * Xfail tests * Remove prints * Remove print * Xfail some tests * Replace unseen labels for parser * Update test * Update test * Xfail test * Fix Corpus * fix imports * fix docs_to_json * various small fixes * cleanup * Support gold_preproc in Corpus * Support gold_preproc * Pass gold_preproc setting into corpus * Remove debugging * Fix gold_preproc * Fix json2docs converter * Fix convert command * Fix flake8 * Fix import * fix output_dir (converted to Path by typer) * fix var * bugfix: update states after creating golds to avoid out of bounds indexing * Improve efficiency of ArEager oracle * pull merge_sent into iob2docs to avoid Doc creation for each line * fix asserts * bugfix excl Span.end in iob2docs * Support max_length in Corpus * Fix arc_eager oracle * Filter out uannotated sentences in NER * Remove debugging in parser * Simplify NER alignment * Fix conversion of NER data * Fix NER init_gold_batch * Tweak efficiency of precomputable affine * Update onto-json default * Update gold test for NER * Fix parser test * Update test * Add NER data test * Fix convert for single file * Fix test * Hack scorer to avoid evaluating non-nered data * Fix handling of NER data in Example * Output unlabelled spans from O biluo tags in iob_utils * Fix unset variable * Return kept examples from init_gold_batch * Return examples from init_gold_batch * Dont return Example from init_gold_batch * Set spaces on gold doc after conversion * Add test * Fix spaces reading * Improve NER alignment * Improve handling of missing values in NER * Restore the 'cutting' in parser training * Add assertion * Print epochs * Restore random cuts in parser/ner training * Implement Doc.copy * Implement Example.copy * Copy examples at the start of Language.update * Don't unset example docs * Tweak parser model slightly * attempt to fix _guess_spaces * _add_entities_to_doc first, so that links don't get overwritten * fixing get_aligned_ner for one-to-many * fix indexing into x_text * small fix biluo_tags_from_offsets * Add onto-ner config * Simplify NER alignment * Fix NER scoring for partially annotated documents * fix indexing into x_text * fix test_cli failing tests by ignoring spans in doc.ents with empty label * Fix limit * Improve NER alignment * Fix count_train * Remove print statement * fix tests, we're not having nothing but None * fix clumsy fingers * Fix tests * Fix doc.ents * Remove empty docs in Corpus and improve limit * Update config Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>	2020-06-26 19:34:12 +02:00
PluieElectrique	90c7eb0e2f	Reduce memory usage of Lookup's BloomFilter (#5606 ) * Reduce memory usage of Lookup's BloomFilter * Remove extra Table update	2020-06-26 14:09:10 +02:00
Adriane Boyd	b7107ac89f	Disregard special tag _SP in check for new tag map (#5641 ) * Skip special tag _SP in check for new tag map In `Tagger.begin_training()` check for new tags aside from `_SP` in the new tag map initialized from the provided gold tuples when determining whether to reinitialize the morphology with the new tag map. * Simplify _SP check	2020-06-26 09:23:21 +02:00
Ines Montani	5d235fb767	Merge branch 'develop' into feature/project-cli	2020-06-25 12:27:58 +02:00
Ines Montani	01c394eb23	Update to latest Typer and remove hacks	2020-06-25 12:27:19 +02:00
Ines Montani	82a03ee18e	Replace python with sys.executable	2020-06-25 12:26:53 +02:00
Adriane Boyd	6fe6e761de	Skip vocab in component config overrides (#5624 )	2020-06-23 23:21:11 +02:00
Adriane Boyd	d94e961f14	Fix polarity of Token.is_oov and Lexeme.is_oov (#5634 ) Fix `Token.is_oov` and `Lexeme.is_oov` so they return `True` when the lexeme does not have a vector.	2020-06-23 13:29:51 +02:00
Ines Montani	8131a65dee	Update __init__.py	2020-06-22 16:09:09 +02:00
Ines Montani	2ad7a02400	Merge branch 'develop' into feature/project-cli	2020-06-22 15:33:11 +02:00
Ines Montani	83b4aa05c9	Merge pull request #5626 from explosion/feature/typer	2020-06-22 06:29:03 -07:00
Ines Montani	0ee6d7a4d1	Remove project stuff from this branch	2020-06-22 14:54:38 +02:00
Ines Montani	a6b76440b7	Update project CLI	2020-06-22 14:53:31 +02:00
Hiroshi Matsuda	150a39ccca	Japanese model: add user_dict entries and small refactor (#5573 ) * user_dict fields: adding inflections, reading_forms, sub_tokens deleting: unidic_tags improve code readability around the token alignment procedure * add test cases, replace fugashi with sudachipy in conftest * move bunsetu.py to spaCy Universe as a pipeline component BunsetuRecognizer * tag is space -> both surface and tag are spaces * consider len(text)==0	2020-06-22 14:32:25 +02:00
Ines Montani	3f2f5f9cb3	Remove ml_datasets from install dependencies	2020-06-22 12:14:51 +02:00
Rameshh	c34420794a	Add Nepali Language (#5622 ) * added support for nepali lang * added examples and test files * added spacy contributor agreement	2020-06-22 10:25:46 +02:00
Karen Hambardzumyan	66a4834e56	Some changes for Armenian (#5616 ) * Fixing numericals * We need a Armenian question sign to make the sentence a question	2020-06-22 08:50:34 +02:00
Ines Montani	dc5d535659	Tidy up info	2020-06-22 01:17:11 +02:00
Ines Montani	189ed56777	Fix and simplify info	2020-06-22 01:07:48 +02:00
Ines Montani	fca3907d4e	Add correct uppercase variants for boolean flags	2020-06-22 00:57:28 +02:00
Ines Montani	79dd824906	Tidy up	2020-06-22 00:45:40 +02:00
Ines Montani	1e5b4d8524	Fix DVC check	2020-06-22 00:30:05 +02:00
Ines Montani	5ba1df5e78	Update project CLI	2020-06-22 00:15:06 +02:00
Ines Montani	ef5f548fb0	Tidy up and auto-format	2020-06-21 22:38:04 +02:00
Ines Montani	f77e0bc028	Merge branch 'develop' into master-tmp	2020-06-21 22:34:15 +02:00
Ines Montani	40bb918a4c	Remove unicode declarations and tidy up	2020-06-21 22:34:10 +02:00
Ines Montani	275bab62df	Refactor CLI	2020-06-21 21:35:01 +02:00
Ines Montani	c12713a8be	Port CLI to Typer and add project stubs	2020-06-21 13:44:00 +02:00
svlandeg	689600e17d	add additional test back in (it works now)	2020-06-20 23:23:57 +02:00
svlandeg	2f6062a8a4	add line that got removed from EntityLinker	2020-06-20 23:14:45 +02:00
svlandeg	12dc8ab208	remove redundant code from master in EntityLinker	2020-06-20 23:07:42 +02:00
svlandeg	6179774278	fix test_build_dependencies by ignoring new libs	2020-06-20 22:49:37 +02:00
svlandeg	256d4c27c8	fix tagger begin_training being called without examples	2020-06-20 22:38:00 +02:00
svlandeg	5cb812e0ab	fix NER warn empty lookups (cf PR #5588 )	2020-06-20 22:04:18 +02:00
svlandeg	c9242e9bf4	fix entity linker (cf PR #5548 )	2020-06-20 21:47:23 +02:00
svlandeg	dc069e90b3	fix token.morph_ for v.3 (cf PR #5517 )	2020-06-20 21:13:11 +02:00
Ines Montani	988d2a4eda	Add --code-path option to train CLI (#5618 )	2020-06-20 18:43:12 +02:00
Ines Montani	5424b70e51	Remove v2 test	2020-06-20 16:18:53 +02:00
Ines Montani	63c22969f4	Update test_issue5230.py	2020-06-20 16:17:48 +02:00
Ines Montani	296b5d633b	Remove references to Python 2 / is_python2	2020-06-20 16:11:13 +02:00
Ines Montani	0cdb631e6c	Fix merge errors	2020-06-20 16:02:42 +02:00
Ines Montani	52728d8fa3	Merge branch 'develop' into master-tmp	2020-06-20 15:52:00 +02:00
Ines Montani	f91e9e8c84	Remove F841 [ci skip]	2020-06-20 14:47:17 +02:00
Ines Montani	8283df80e9	Tidy up and auto-format	2020-06-20 14:15:04 +02:00
Marat M. Yavrumyan	8120b641cc	Update lex_attrs.py (#5608 )	2020-06-19 20:00:34 +02:00
Ines Montani	e9d3e177f0	Merge branch 'master' into v2.3.x	2020-06-16 16:31:38 +02:00
Matthew Honnibal	7ff447c5a0	Set version to v2.3.0	2020-06-15 18:22:25 +02:00
Adriane Boyd	0d8405aafa	Updates to docstrings (#5589 )	2020-06-15 14:58:36 +02:00
Adriane Boyd	e867e9fa8f	Fix and add warnings related to spacy-lookups-data (#5588 ) * Fix warning message for lemmatization tables * Add a warning when the `lexeme_norm` table is empty. (Given the relatively lang-specific loading for `Lookups`, it seemed like too much overhead to dynamically extract the list of languages, so for now it's hard-coded.)	2020-06-15 14:58:29 +02:00
Arvind Srinivasan	f698007907	Added Tamil Example Sentences (#5583 ) * Added Examples for Tamil Sentences #### Description This PR add example sentences for the Tamil language which were missing as per issue #1107 #### Type of Change This is an enhancement. * Accepting spaCy Contributor Agreement * Signed on my behalf as an individual	2020-06-15 14:58:21 +02:00
Adriane Boyd	c94f7d0e75	Updates to docstrings (#5589 )	2020-06-15 14:56:51 +02:00
Adriane Boyd	c482f20778	Fix and add warnings related to spacy-lookups-data (#5588 ) * Fix warning message for lemmatization tables * Add a warning when the `lexeme_norm` table is empty. (Given the relatively lang-specific loading for `Lookups`, it seemed like too much overhead to dynamically extract the list of languages, so for now it's hard-coded.)	2020-06-15 14:56:04 +02:00
Arvind Srinivasan	aa5b40fa64	Added Tamil Example Sentences (#5583 ) * Added Examples for Tamil Sentences #### Description This PR add example sentences for the Tamil language which were missing as per issue #1107 #### Type of Change This is an enhancement. * Accepting spaCy Contributor Agreement * Signed on my behalf as an individual	2020-06-13 15:56:26 +02:00
theudas	3f5e2f9d99	Added Parameter to NEL to take n sentences into account (#5548 ) * added setting for neighbour sentence in NEL * added spaCy contributor agreement * added multi sentence also for training * made the try-except block smaller	2020-06-12 15:15:03 +02:00
adrianeboyd	4724fa4cf4	Expand Japanese requirements warning (#5572 ) Include explicit install instructions in Japanese requirements warning.	2020-06-12 15:14:55 +02:00
adrianeboyd	44967a3f9c	Update pytest conf for sudachipy with Japanese (#5574 )	2020-06-12 15:14:47 +02:00
Matthew Honnibal	a1c5b694be	Small fixes to train defaults	2020-06-12 02:22:13 +02:00
theudas	fa46e0bef2	Added Parameter to NEL to take n sentences into account (#5548 ) * added setting for neighbour sentence in NEL * added spaCy contributor agreement * added multi sentence also for training * made the try-except block smaller	2020-06-12 02:03:23 +02:00
Sofie Van Landeghem	c0f4a1e43b	train is from-config by default (#5575 ) * verbose and tag_map options * adding init_tok2vec option and only changing the tok2vec that is specified * adding omit_extra_lookups and verifying textcat config * wip * pretrain bugfix * add replace and resume options * train_textcat fix * raw text functionality * improve UX when KeyError or when input data can't be parsed * avoid unnecessary access to goldparse in TextCat pipe * save performance information in nlp.meta * add noise_level to config * move nn_parser's defaults to config file * multitask in config - doesn't work yet * scorer offering both F and AUC options, need to be specified in config * add textcat verification code from old train script * small fixes to config files * clean up * set default config for ner/parser to allow create_pipe to work as before * two more test fixes * small fixes * cleanup * fix NER pickling + additional unit test * create_pipe as before	2020-06-12 02:02:07 +02:00
adrianeboyd	556895177e	Expand Japanese requirements warning (#5572 ) Include explicit install instructions in Japanese requirements warning.	2020-06-11 13:47:37 +02:00
adrianeboyd	fe167fcf7d	Update pytest conf for sudachipy with Japanese (#5574 )	2020-06-11 10:23:50 +02:00
Jones Martins	bab30e4ad2	Add "c'mon" token exception (#5570 ) * Add "c'mon" exception * Fix typo in "C'mon" exception	2020-06-10 21:54:06 +02:00
Jones Martins	28db7dd5d9	Add missing pronoums/determiners (#5569 ) * Add missing pronoums/determiners * Add test for missing pronoums * Add contributor file	2020-06-10 18:47:04 +02:00
adrianeboyd	0a70bd6281	Bump version to 2.3.0.dev1 (#5567 )	2020-06-09 15:47:31 +02:00
adrianeboyd	b7e6e1b9a7	Disable sentence segmentation in ja tokenizer (#5566 )	2020-06-09 12:00:59 +02:00
adrianeboyd	f162815f45	Handle empty and whitespace-only docs for Japanese (#5564 ) Handle empty and whitespace-only docs in the custom alignment method used by the Japanese tokenizer.	2020-06-08 21:09:23 +02:00
adrianeboyd	3bf111585d	Update Japanese tokenizer config and add serialization (#5562 ) * Use `config` dict for tokenizer settings * Add serialization of split mode setting * Add tests for tokenizer split modes and serialization of split mode setting Based on #5561	2020-06-08 16:29:05 +02:00
Hiroshi Matsuda	456bf47f51	fix a bug causing mis-alignments (#5560 )	2020-06-08 15:49:34 +02:00
Ines Montani	d93cbeb14f	Add warning for loose version constraints (#5536 ) * Add warning for loose version constraints * Update wording [ci skip] * Tweak error message Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-06-05 12:42:15 +02:00
adrianeboyd	1ac43d78f9	Avoid libc.stdint for UINT64_MAX (#5545 )	2020-06-04 20:02:05 +02:00
Paul O'Leary McCann	410fb7ee43	Add Japanese Model (#5544 ) * Add more rules to deal with Japanese UD mappings Japanese UD rules sometimes give different UD tags to tokens with the same underlying POS tag. The UD spec indicates these cases should be disambiguated using the output of a tool called "comainu", but rules are enough to get the right result. These rules are taken from Ginza at time of writing, see #3756. * Add new tags from GSD This is a few rare tags that aren't in Unidic but are in the GSD data. * Add basic Japanese sentencization This code is taken from Ginza again. * Add sentenceizer quote handling Could probably add more paired characters but this will do for now. Also includes some tests. * Replace fugashi with SudachiPy * Modify tag format to match GSD annotations Some of the tests still need to be updated, but I want to get this up for testing training. * Deal with case with closing punct without opening * refactor resolve_pos() * change tag field separator from "," to "-" * add TAG_ORTH_MAP * add TAG_BIGRAM_MAP * revise rules for 連体詞 * revise rules for 連体詞 * improve POS about 2% * add syntax_iterator.py (not mature yet) * improve syntax_iterators.py * improve syntax_iterators.py * add phrases including nouns and drop NPs consist of STOP_WORDS * First take at noun chunks This works in many situations but still has issues in others. If the start of a subtree has no noun, then nested phrases can be generated. また行きたい、そんな気持ちにさせてくれるお店です。 [そんな気持ち, また行きたい、そんな気持ちにさせてくれるお店] For some reason て gets included sometimes. Not sure why. ゲンに連れ添って円盤生物を調査するパートナーとなる。 [て円盤生物, ...] Some phrases that look like they should be split are grouped together; not entirely sure that's wrong. This whole thing becomes one chunk: 道の駅遠山郷北側からかぐら大橋南詰現道交点までの1.060kmのみ開通済み * Use new generic get_words_and_spaces The new get_words_and_spaces function is simpler than what was used in Japanese, so it's good to be able to switch to it. However, there was an issue. The new function works just on text, so POS info could get out of sync. Fixing this required a small change to the way dtokens (tokens with POS and lemma info) were generated. Specifically, multiple extraneous spaces now become a single token, so when generating dtokens multiple space tokens should be created in a row. * Fix noun_chunks, should be working now * Fix some tests, add naughty strings tests Some of the existing tests changed because the tokenization mode of Sudachi changed to the more fine-grained A mode. Sudachi also has issues with some strings, so this adds a test against the naughty strings. * Remove empty Sudachi tokens Not doing this creates zero-length tokens and causes errors in the internal spaCy processing. * Add yield_bunsetu back in as a separate piece of code Co-authored-by: Hiroshi Matsuda <40782025+hiroshi-matsuda-rit@users.noreply.github.com> Co-authored-by: hiroshi <hiroshi_matsuda@megagon.ai>	2020-06-04 19:15:43 +02:00
Matthew Honnibal	8411d4f4e6	Merge pull request #5543 from svlandeg/feature/pretrain-config pretrain from config	2020-06-04 19:07:12 +02:00
svlandeg	3ade455fd3	formatting	2020-06-04 16:09:55 +02:00
svlandeg	776d4f1190	cleanup	2020-06-04 16:07:30 +02:00
svlandeg	6b027d7689	remove duplicate model definition of tok2vec layer	2020-06-04 15:49:23 +02:00
svlandeg	1775f54a26	small little fixes	2020-06-03 22:17:02 +02:00
svlandeg	07886a3de3	rename init_tok2vec to resume	2020-06-03 22:00:25 +02:00
svlandeg	4ed6278663	small fixes to pretrain config, init_tok2vec TODO	2020-06-03 19:32:40 +02:00
svlandeg	ffe0451d09	pretrain from config	2020-06-03 14:45:00 +02:00
Ines Montani	a8875d4a4b	Fix typo	2020-06-03 14:42:39 +02:00
Ines Montani	4e0610d0d4	Update warning codes	2020-06-03 14:37:09 +02:00
Ines Montani	810fce3bb1	Merge branch 'develop' into master-tmp	2020-06-03 14:36:59 +02:00
Adriane Boyd	b0ee76264b	Remove debugging	2020-06-03 14:20:42 +02:00
Adriane Boyd	1d8168d1fd	Fix problems with lower and whitespace in variants Port relevant changes from #5361: * Initialize lower flag explicitly * Handle whitespace words from GoldParse correctly when creating raw text with orth variants	2020-06-03 14:15:58 +02:00
Adriane Boyd	10d938f221	Update default cfg dir in train CLI	2020-06-03 14:15:50 +02:00
Adriane Boyd	f1f9c8b417	Port train CLI updates Updates from #5362 and fix from #5387: * `train`: * if training on GPU, only run evaluation/timing on CPU in the first iteration * if training is aborted, exit with a non-0 exit status	2020-06-03 14:03:43 +02:00
Adriane Boyd	8c758ed1eb	Fix meta path	2020-06-03 12:11:57 +02:00
Adriane Boyd	a57bdeecac	Test util.get_model_meta instead of util.load_model	2020-06-03 12:10:12 +02:00
svlandeg	eac12cbb77	make dropout in embed layers configurable	2020-06-03 11:50:16 +02:00
svlandeg	e91485dfc4	add discard_oversize parameter, move optimizer to training subsection	2020-06-03 10:04:16 +02:00
svlandeg	03c58b488c	prevent infinite loop, custom warning	2020-06-03 10:00:21 +02:00
svlandeg	6504b7f161	Merge remote-tracking branch 'upstream/develop' into feature/pretrain-config	2020-06-03 08:30:16 +02:00
svlandeg	c5ac382f0a	fix name clash	2020-06-02 22:24:57 +02:00
svlandeg	2bf5111ecf	additional test with discard_oversize=False	2020-06-02 22:09:37 +02:00
svlandeg	aa6271b16c	extending algorithm to deal better with edge cases	2020-06-02 22:05:08 +02:00
svlandeg	f2e162fc60	it's only oversized if the tolerance level is also exceeded	2020-06-02 19:59:04 +02:00
svlandeg	ef834b4cd7	fix comments	2020-06-02 19:50:44 +02:00
svlandeg	6208d322d3	slightly more challenging unit test	2020-06-02 19:47:30 +02:00
svlandeg	6651fafd5c	using overflow buffer for examples within the tolerance margin	2020-06-02 19:43:39 +02:00
svlandeg	85b0597ed5	add test for minibatch util	2020-06-02 18:26:21 +02:00
svlandeg	5b350a6c99	bugfix of the bugfix	2020-06-02 17:49:33 +02:00
Adriane Boyd	75f08ad62d	Remove unnecessary check	2020-06-02 17:41:25 +02:00
Adriane Boyd	bbc1836581	Add rudimentary version checks on model load	2020-06-02 17:33:48 +02:00
svlandeg	fdfd822936	rewrite minibatch_by_words function	2020-06-02 15:22:54 +02:00
svlandeg	ec52e7f886	add oversize examples before StopIteration returns	2020-06-02 13:21:55 +02:00
svlandeg	e0f9f448f1	remove Tensorizer	2020-06-01 23:38:48 +02:00
Leo	925e938570	Spanish tokenizer exception and examples improvement (#5531 ) * Spanish tokenizer exception additions. Added Spanish question examples * erased slang tokenization examples	2020-06-01 18:18:34 +02:00
Matthew Honnibal	67af3a32b0	Merge pull request #5527 from adrianeboyd/bugfix/tagger-sp-tag-map Preserve _SP when filtering tag map in Tagger	2020-06-01 12:00:21 +02:00
Leo	c21c308ecb	corrected issue #5524 changed <U+009C> 'STRING TERMINATOR' for <U+0153> LATIN SMALL LIGATURE OE' (#5526 )	2020-05-31 22:08:12 +02:00
Adriane Boyd	a005ccd6d7	Preserve _SP when filtering tag map in Tagger To allow "SP" as a tag (for Chinese OntoNotes), preserve "_SP" if present as the reference `SPACE` POS in the tag map in `Tagger.begin_training()`.	2020-05-31 19:57:54 +02:00

... 3 4 5 6 7 ...

7406 Commits