spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-04-23 10:31:58 +03:00

Author	SHA1	Message	Date
Sofie Van Landeghem	dd207a28be	cleanup components API (#5726 ) * add keyword separator for update functions and drop unused "state" * few more Example tests and various small fixes * consistently return losses after update call * eliminate unused tensors field across pipe components * fix name * fix arg name	2020-07-09 19:43:39 +02:00
Adriane Boyd	ac4297ee39	Minor refactor to conversion of output docs (#5718 ) Minor refactor of conversion of docs to output format to avoid duplicate conversion steps.	2020-07-09 19:42:32 +02:00
Sofie Van Landeghem	c1ea55307b	Fixing reproducible training (#5735 ) * Add initial reproducibility tests * failing test for default_text_classifier (WIP) * track trouble to underlying tok2vec layer * add regression test for Issue 5551 * tests go green with https://github.com/explosion/thinc/pull/359 * update test * adding fixed seeds to HashEmbed layers, seems to fix the reproducility issue Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-07-09 19:39:31 +02:00
Matthew Honnibal	1827f22f56	Set version to v3.0.0a3	2020-07-09 19:38:04 +02:00
Matthw Honnibal	7010f1a2be	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2020-07-09 19:34:11 +02:00
Matthw Honnibal	77af0a6bb4	Offer option of padding-sensitive batching	2020-07-09 14:50:20 +02:00
Matthw Honnibal	3a7f275c02	Add extra batch util	2020-07-09 14:38:41 +02:00
Matthw Honnibal	eb0798c421	Add __len__ method for Example	2020-07-09 14:38:26 +02:00
Ines Montani	8f9552d9e7	Refactor project CLI (#5732 ) * Make project command a submodule * Update with WIP * Add helper for joining commands * Update docstrins, formatting and types * Update assets and add support for copying local files * Fix type * Update success messages	2020-07-09 01:42:51 +02:00
Adriane Boyd	ad15499b3b	Fix get_loss for values outside of labels in senter (#5730 ) * Fix get_loss for None alignments in senter When converting the `sent_start` values back to `SentenceRecognizer` labels, handle `None` alignments. * Handle SENT_START as -1 Handle SENT_START as -1 (or -1 converted to uint64) by treating any values other than 1 the same as 0 in `SentenceRecognizer.get_loss`.	2020-07-09 01:41:58 +02:00
Matthw Honnibal	1b20ffac38	batch_by_words by default	2020-07-08 21:37:06 +02:00
Matthw Honnibal	93e50da46a	Remove auto 'set_annotation' in training to address GPU memory	2020-07-08 21:36:51 +02:00
Matthw Honnibal	fb8a5967c1	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2020-07-08 15:27:50 +02:00
Ines Montani	0a3d41bb1d	Deprecat model shortcuts and simplify download (#5722 )	2020-07-08 14:00:07 +02:00
Adriane Boyd	c9f0f75778	Update get_loss for senter and morphologizer (#5724 ) * Update get_loss for senter Update `SentenceRecognizer.get_loss` to keep it similar to `Tagger`. * Update get_loss for morphologizer Update `Morphologizer.get_loss` to keep it similar to `Tagger`.	2020-07-08 13:59:28 +02:00
Matthw Honnibal	ca989f4cc4	Improve cutting logic in parser	2020-07-08 11:27:54 +02:00
Matthw Honnibal	42e1109def	Support option to not batch by number of words	2020-07-08 11:26:54 +02:00
Ines Montani	8cb7f9ccff	Improve assets and DVC handling (#5719 ) * Improve assets and DVC handling * Remove outdated comment [ci skip]	2020-07-07 20:51:50 +02:00
Sofie Van Landeghem	a39a110c4e	Few more Example unit tests (#5720 ) * small fixes in Example, UX * add gold tests for aligned_spans and get_aligned_parse * sentencizer unnecessary	2020-07-07 18:46:00 +02:00
Matthw Honnibal	433dc3c9c9	Simplify PrecomputableAffine slightly	2020-07-07 17:22:47 +02:00
Matthw Honnibal	a4164f67ca	Don't normalize gradients	2020-07-07 17:21:58 +02:00
Matthw Honnibal	8177f25b6c	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2020-07-07 17:21:10 +02:00
Ines Montani	fa00a85828	Merge pull request #5715 from explosion/chore/tidy-regression-tests	2020-07-07 11:22:07 +02:00
Matthw Honnibal	d1fd3438c3	Add dropout to parser hidden layer	2020-07-07 01:38:15 +02:00
Matthw Honnibal	f25761e513	Dont randomize cuts in parser	2020-07-06 17:51:25 +02:00
Matthw Honnibal	709fc5e4ad	Clarify dropout and seed in Tok2Vec	2020-07-06 17:50:21 +02:00
Matthew Honnibal	19d42f42de	Set version to v3.0.0a2	2020-07-06 17:43:12 +02:00
Matthew Honnibal	cc477be952	Improve gold-standard alignment (#5711 ) * Remove previous alignment * Implement better alignment, using ragged data structure * Use pytokenizations for alignment * Fixes * Fixes * Fix overlapping entities in alignment * Fix align split_sents * Update test * Commit align.py * Try to appease setuptools * Fix flake8 * use realistic entities for testing * Update tests for better alignment * Improve alignment heuristic Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>	2020-07-06 17:39:31 +02:00
Ines Montani	b6deef80f8	Fix class to pickling works as expected	2020-07-06 16:43:45 +02:00
Ines Montani	fa261d09e8	Add alternative CLI option	2020-07-06 15:57:38 +02:00
Adriane Boyd	c67fc6aa5b	Make `docs_to_json` backwards-compatible with v2 (#5714 ) * In `spacy convert -t json` output the JSON docs wrapped in a list * Add back token-level `ner` alongside the doc-level `entities`	2020-07-06 14:15:00 +02:00
Ines Montani	5b7b2a498d	Tidy up and merge regression tests	2020-07-06 14:05:59 +02:00
Ines Montani	412dbb1f38	Remove dead and/or deprecated code (#5710 ) * Remove dead and/or deprecated code * Remove n_threads Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-07-06 13:06:25 +02:00
Sofie Van Landeghem	fcbf899b08	Feature/example only (#5707 ) * remove _convert_examples * fix test_gold, raise TypeError if tuples are used instead of Example's * throwing proper errors when the wrong type of objects are passed * fix deprectated format in tests * fix deprectated format in parser tests * fix tests for NEL, morph, senter, tagger, textcat * update regression tests with new Example format * use make_doc * more fixes to nlp.update calls * few more small fixes for rehearse and evaluate * only import ml_datasets if really necessary	2020-07-06 13:02:36 +02:00
Matthw Honnibal	3f6f087113	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2020-07-04 23:52:12 +02:00
Matthw Honnibal	5642507823	Fix has_unknown_spaces in Doc.copy	2020-07-04 23:52:02 +02:00
Matthw Honnibal	8870a6ded7	Specify seeds in HashEmbed	2020-07-04 23:51:49 +02:00
Ines Montani	37c3bb35e2	Auto-format	2020-07-04 16:25:34 +02:00
Ines Montani	abd173937f	Auto-format and update URL	2020-07-04 14:23:44 +02:00
Ines Montani	99aff16d60	Make argument shortcut consistent	2020-07-04 14:23:32 +02:00
Matthew Honnibal	2bd1bf81f1	Refactor pretrain and support character-based objective for v3 (#5706 ) * Start adding character-based stuff * Start adding character-based objective * Start adding character-based stuff * Start adding character-based objective * Remove outdated comment * Update pretraining models * Add/fix character-based multi-task models * Refactor pretrain and support character-based objective * Update pretrain config * Remove unused * Fix flake8 errors * Clean up imports * Format * Format * Update Thinc version * Raise error if vectors objective but no vectors	2020-07-03 17:57:28 +02:00
Ines Montani	84fb3a3fb3	Auto-format and fix tuple	2020-07-03 15:20:10 +02:00
Matthew Honnibal	e1b3e8ee11	Set version to v3.0.0a1	2020-07-03 13:21:08 +02:00
Matthew Honnibal	a902b5f217	Record whether Doc objects are built from known spacing (#5697 ) * Tell convert CLI to store user data for Doc * Remove assert * Add has_unknwon_spaces flag on Doc * Do not tokenize docs with unknown spaces in Corpus * Handle conversion of unknown spaces in Example * Fixes * Fixes * Draft has_known_spaces support in DocBin * Add test for serialize has_unknown_spaces * Fix DocBin serialization when has_unknown_spaces * Use serialization in test	2020-07-03 12:58:16 +02:00
Adriane Boyd	abad56db7d	Add conllu2docs converter (#5704 ) Add conllu2docs converter adapted from conllu2json converter	2020-07-03 12:54:32 +02:00
Jan Jessewitsch	e4dcac4a4b	Merging multiple docs into one (#5032 ) * Add static method to Doc to allow merging of multiple docs. * Add error description for the error that occurs if docs with different vocabs (from different languages) are merged in Doc.from_docs(). * Add test for Doc.from_docs() implementation. * Fix using numpy's concatenate in Doc.from_docs. * Replace typing's type annotations in from_docs. * Simply remove type annotations in from_docs. * Add documentation for Doc.from_docs to api. * Simplify from_docs, its test and the api doc for codebase consistency. * Fix merging of Doc objects that end with whitespaces (Achieved by simply not setting the SPACY attribute on whitespace tokens). Remove two unnecessary imports of attributes. * Add merging of user data from Doc objects in from_docs. Add user data test case to corresponding test. Add applicable warning messages. * Fix incorrect setting of tokens idx by using concatenated spaces (again). Add test case to corresponding test. * Add MORPH to attrs * Update warnings calls * Remove out-dated error from merge * Rename space_delimiter to ensure_whitespace Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2020-07-03 11:32:42 +02:00
Sofie Van Landeghem	41b65fd0f8	fix to pretrain script (#5699 ) * fix to pretrain script * remove unnecessary import	2020-07-02 21:48:01 +02:00
Adriane Boyd	a723fa02a1	DocBin: add version number, missing attributes and strings (#5685 ) * Add version number to DocBin Add a version number to DocBin for future use. * Add POS to all attributes in DocBin * Add morph string to strings in DocBin * Update DocBin API * Add string for ENT_KB_ID in DocBin	2020-07-02 17:41:50 +02:00
Ines Montani	d36632553a	Merge pull request #5688 from explosion/remove-deprecated Remove deprecated methods: Doc.print_tree, Doc.merge, Span.merge	2020-07-02 15:10:30 +02:00
Ines Montani	8a5b9a6d5f	Merge pull request #5693 from svlandeg/bugfix/nel-v3	2020-07-02 14:45:46 +02:00

1 2 3 4 5 ...

7207 Commits