spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-04-17 15:41:59 +03:00

Author	SHA1	Message	Date
Sofie Van Landeghem	113e8d082b	only evaluate named entities for NEL if there is a corresponding gold span (#7074 )	2021-02-22 11:06:50 +11:00
Adriane Boyd	264862c67a	Fix Ukrainian lemmatizer init (#7127 ) Fix class variable and init for `UkrainianLemmatizer` so that it loads the `uk` dictionaries rather than having the parent `RussianLemmatizer` override with the `ru` settings.	2021-02-22 11:05:08 +11:00
Sofie Van Landeghem	ba5a50f62b	NEL docs & UX (#7129 ) * EL set_kb docs fix * custom warning for set_kb mistake	2021-02-22 11:04:22 +11:00
Boian Tzonev	cca8651fc8	Bulgarian tokenizer exceptions (#7114 ) * [Bulgarian] Add tokenizer exceptions and like_num for Bulgarian * [Bulgarian] Add tokenizer exceptions and like_num for Bulgarian	2021-02-19 19:19:19 +01:00
Sofie Van Landeghem	709c9e75af	span.ent only returns first sentence (#7084 ) * return first sentence when span contains sentence boundary * docs fix * small fixes * cleanup	2021-02-19 23:02:38 +11:00
Adriane Boyd	30e1a89aeb	Fix displacy output in evaluate CLI (#7122 ) Now that `nlp.evaluate()` does not modify the examples, rerun the pipeline on the (limited) texts in order to provide the predicted annotation in the displacy output option.	2021-02-19 23:01:20 +11:00
Adriane Boyd	4188beda87	Fix conll converter option (#7071 ) Map `conll` to the NER converter, not the `CoNLL-U` converter.	2021-02-18 10:22:41 +01:00
Adriane Boyd	a3293efc48	Add time and level to default logging formatter	2021-02-15 14:19:20 +01:00
Ines Montani	1e3a326e53	Change Dutch transformer recommendation [ci skip] https://github.com/explosion/spaCy/discussions/6529#discussioncomment-366620	2021-02-14 15:30:16 +11:00
Ines Montani	f4f46b617f	Preserve sourced components in fill-config (fixes #7055 ) (#7058 )	2021-02-14 14:02:14 +11:00
Matthew Honnibal	0fb8d437c0	Fix sentence fragments bug (#7056 , #7035 ) (#7057 ) * Add test for #7035 * Update test for issue 7056 * Fix test * Fix transitions method used in testing * Fix state eol detection when rebuffer * Clean up redundant fix	2021-02-14 13:38:13 +11:00
Ines Montani	660642902a	Increment version [ci skip]	2021-02-14 13:36:13 +11:00
Matthew Honnibal	b31471b5b8	Set version to v3.0.2	2021-02-13 23:50:00 +11:00
Ines Montani	9ba715ed16	Tidy up and auto-format	2021-02-13 12:55:56 +11:00
Ines Montani	34ee0fbd70	Merge pull request #7011 from Shumie82/master	2021-02-13 12:30:42 +11:00
Ines Montani	e583050547	Merge pull request #7039 from svlandeg/debug	2021-02-13 11:53:41 +11:00
Ines Montani	6c450decfc	Fix punctuation settings and add to initialize tests	2021-02-13 11:51:21 +11:00
Ines Montani	f4712a634e	Merge pull request #7046 from adrianeboyd/bugfix/vocab-pickle-noun-chunks-6891 Include noun chunks method when pickling Vocab	2021-02-13 11:43:03 +11:00
Adriane Boyd	0ee2ae86bf	Update trf quickstart recommendations Add/update trf recommendations for Bengali, Hindi, Sinhala, and Tamil based on #7044.	2021-02-12 15:55:17 +01:00
svlandeg	03b4ec7d7f	fix typo	2021-02-12 14:30:16 +01:00
Adriane Boyd	5e47a54d29	Include noun chunks method when pickling Vocab	2021-02-12 13:27:46 +01:00
svlandeg	aa3ad8825d	loop instead of any	2021-02-12 13:14:30 +01:00
svlandeg	278e9eaa14	remove ner	2021-02-11 21:08:04 +01:00
svlandeg	ebeedfc70b	regression test for 7029	2021-02-11 20:56:48 +01:00
svlandeg	a52d466bfc	any instead of all	2021-02-11 20:50:55 +01:00
Shumi	4e514f1ea8	Update stop_words.py I have deleted line 1 to 5 and the statement print(STOP_WORDS)	2021-02-11 21:30:34 +02:00
Shumi	0d57e84b7b	Update lex_attrs.py I have removed line 1 to 4	2021-02-11 21:28:23 +02:00
Shumi	37ec67f868	Update examples.py I have removed two lines: # coding: utf8 from __future__ import unicode_literals And updated: >>> from spacy.lang.tn.examples import sentences	2021-02-11 21:25:58 +02:00
Shumi	39eeba6760	Update __init__.py Added infixes = TOKENIZER_INFIXES	2021-02-11 21:20:46 +02:00
Ines Montani	26bf642afd	Fix issue #7019 : Handle None scores in evaluate printer (#7026 )	2021-02-11 16:45:23 +11:00
Ines Montani	6b9026a219	Merge pull request #7000 from explosion/feature/project-yml-overrides Support env vars and CLI overrides for project.yml	2021-02-11 12:31:45 +11:00
Ines Montani	ad9ce3c8f6	Fix issue #6950 : allow pickling Tok2Vec with listeners	2021-02-11 11:37:39 +11:00
Shumi	ed3397727e	Delete tag_map.py Tag map file is deleted. I will add it later because it was failing validations	2021-02-10 20:41:18 +02:00
Shumi	7c8721b1bd	Update tag_map.py Updated tag_map	2021-02-10 20:21:22 +02:00
Shumi	f6be28cfb2	Added files to Setswana Language Add South African Setswana Language	2021-02-10 20:15:13 +02:00
Shumi	24046fef17	South African Setswana language Please accept the additional of Setswana language	2021-02-10 20:12:33 +02:00
Peter Baumann	61b04a70d5	Run PhraseMatcher on Spans (#6918 ) * Add regression test * Run PhraseMatcher on Spans * Add test for PhraseMatcher on Spans and Docs * Add SCA * Add test with 3 matches in Doc, 1 match in Span * Update docs * Use doc.length for find_matches in tokenizer Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2021-02-10 23:43:32 +11:00
Ines Montani	21176c69b0	Update and add test	2021-02-10 14:12:00 +11:00
Ines Montani	c08b3f294c	Support env vars and CLI overrides for project.yml	2021-02-10 13:45:27 +11:00
Koichi Yasuoka	8ed788660b	Several callable objects do not have __qualname__	2021-02-09 14:43:02 +09:00
Adriane Boyd	6108dabdc8	Rephrase error related to sample data initialization Now that the initialize step is fully implemented, the source of E923 is typically missing or improperly converted/formatted data rather than a bug in spaCy, so rephrase the error and message and remove the prompt to open an issue.	2021-02-08 09:21:36 +01:00
Sofie Van Landeghem	6ed423c16c	reduce memory load when reading all vectors from file (#6945 ) * reduce memory load when reading all vectors from file * one more small typo fix	2021-02-07 08:05:43 +08:00
Sofie Van Landeghem	a323ef90df	ensure the loss value is cast as float (#6928 )	2021-02-07 07:51:56 +08:00
melonwater211	a7977b5143	The test `spacy/tests/vocab_vectors/test_lexeme.py::test_vocab_lexeme_add_flag_auto_id` seems to fail occasionally when the test suite is run in a random order. (#6956 ) ```python def test_vocab_lexeme_add_flag_auto_id(en_vocab): is_len4 = en_vocab.add_flag(lambda string: len(string) == 4) assert en_vocab["1999"].check_flag(is_len4) is True assert en_vocab["1999"].check_flag(IS_DIGIT) is True assert en_vocab["199"].check_flag(is_len4) is False > assert en_vocab["199"].check_flag(IS_DIGIT) is True E assert False is True E + where False = <built-in method check_flag of spacy.lexeme.Lexeme object at 0x7fa155c36840>(3) E + where <built-in method check_flag of spacy.lexeme.Lexeme object at 0x7fa155c36840> = <spacy.lexeme.Lexeme object at 0x7fa155c36840>.check_flag spacy/tests/vocab_vectors/test_lexeme.py:49: AssertionError ``` > `pytest==6.1.1` > > `numpy==1.19.2` > > `Python version: 3.8.3` To reproduce the error, run `pytest --random-order-bucket=global --random-order-seed=170158 -v spacy/tests` If `test_vocab_lexeme_add_flag_auto_id` is run after `test_vocab_lexeme_add_flag_provided_id`, it fails. It seems like `test_vocab_lexeme_add_flag_provided_id` uses the `IS_DIGIT` bit for testing purposes but does not reset the bit. This solution seems to work but, if anyone has a better fix, please let me know and I will integrate it.	2021-02-07 07:51:34 +08:00
René Octavio Queiroz Dias	59271e887a	fix: TransformerListener with TextCatEnsemble (#6951 ) * bug: Regression test Issue #6946 * fix: Fix issue #6946 * chore: Remove regression test	2021-02-06 13:44:51 +01:00
René Octavio Queiroz Dias	999ff03b19	fix: Fix textcat labels to expect a Optional[Iterable[str]] instead of Optional[Dict] (#6911 ) * docs: Add agreement * bug: Regression test Issue #6908 * fix: Changed from Dict to Iterable[str] Fix #6908 * Update test to use make_tempdir * fix: Fix WindowsPath error Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2021-02-04 23:37:13 +01:00
Adriane Boyd	b903de3fcb	Pass on vocab arg in spacy.blank() (#6924 )	2021-02-04 15:09:01 +01:00
svlandeg	f852af2acf	add capture arg	2021-02-02 19:47:12 +01:00
Matthew Honnibal	b6a198481b	Set version to v3.0.0	2021-02-02 20:26:17 +11:00
Sofie Van Landeghem	f319d2765f	Add capture argument to project_run (#6878 ) * add capture argument to project_run and run_commands * git bump to 3.0.1 * Set version to 3.0.1.dev0 Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2021-02-02 10:11:15 +08:00
Sofie Van Landeghem	f638306598	remove link_components flag again (#6883 )	2021-02-02 10:08:40 +08:00
Ines Montani	a59f3fcf5d	Make wheel the default format and update docs [ci skip]	2021-02-01 23:18:43 +11:00
Ines Montani	b9573e9e22	Fix pip args	2021-02-01 23:15:00 +11:00
Ines Montani	b46073234a	Fix default clone branch and error handling [ci skip]	2021-02-01 22:29:04 +11:00
Sofie Van Landeghem	acabb284dd	Fix linking resumed components (#6859 ) * link components across enabled, resumed and frozen * revert renaming * revert renaming, the sequel	2021-02-01 22:19:58 +11:00
Adriane Boyd	35a863cd27	Remove nlp.tokenizer from quickstart template Remove `nlp.tokenizer` from quickstart template so that the default language-specific tokenizer settings are filled instead.	2021-02-01 11:20:12 +01:00
svlandeg	91e72c031e	reformatting	2021-01-30 17:29:33 +01:00
svlandeg	a8d84188f0	add stop words Co-authored-by: tewodrosm <tedmaam2006@gmail.com>	2021-01-30 17:26:49 +01:00
Ines Montani	f058cbd751	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2021-01-30 21:03:25 +11:00
Ines Montani	14f631f52c	Update parent package and version [ci skip]	2021-01-30 20:12:42 +11:00
Ines Montani	3435b894df	Remove nightly reference from auto docs [ci skip]	2021-01-30 20:12:08 +11:00
Ines Montani	d0c3775712	Replace links to nightly docs [ci skip]	2021-01-30 20:09:38 +11:00
Ines Montani	b26a3daa9a	Merge pull request #6860 from explosion/feature/package-wheel	2021-01-30 14:17:01 +11:00
Ines Montani	2332c4280b	Update and use unified --build option	2021-01-30 13:11:36 +11:00
Ines Montani	e6accb3a9e	Tidy up and auto-format	2021-01-30 12:52:33 +11:00
Ines Montani	817b0db521	Fix escape sequence	2021-01-30 12:39:58 +11:00
Ines Montani	526b416118	Tidy up comments	2021-01-30 12:34:09 +11:00
Ines Montani	30765674d0	Merge branch 'master' into develop	2021-01-30 12:20:28 +11:00
Ines Montani	2609ba4e89	Support building wheel in spacy package	2021-01-30 11:54:02 +11:00
Pamphile ROY	41ee75ac6d	Remove --no-cache-dir when downloading models When `--no-cache-dir` is present, it prevents caching to properly function. If the user still wants to do this, there is the possibility to pass options with `user_pip_args`. But you should not enforce options like these. In my case this is preventing some docker build (using buildkit caching) to have proper caching of models.	2021-01-29 15:37:44 +01:00
Ines Montani	bbf080dfe5	Merge pull request #6645 from bittlingmayer/patch-3	2021-01-30 01:26:28 +11:00
Adriane Boyd	bced6309e5	Add full exceptions with spaces	2021-01-29 14:27:22 +01:00
Ines Montani	7886d59c56	Add check for remove_listener method	2021-01-29 23:47:30 +11:00
Ines Montani	7694f76dd1	Update warning and mention replace_listeners	2021-01-29 23:46:01 +11:00
Ines Montani	94232aea08	Improve E889	2021-01-29 23:39:23 +11:00
Ines Montani	924396c20c	Merge branch 'feature/replace-listeners' of https://github.com/explosion/spaCy into feature/replace-listeners	2021-01-29 21:43:10 +11:00
Ines Montani	2102082478	Make Tok2Vec.remove_listener return bool Whether listener was removed	2021-01-29 21:41:38 +11:00
Ines Montani	e766e8c56d	Apply suggestions from code review Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2021-01-29 21:41:17 +11:00
Ines Montani	bc089b693c	Update tests	2021-01-29 19:38:09 +11:00
Ines Montani	325f47500d	Move replacement logic to Language.from_config	2021-01-29 19:37:04 +11:00
Ines Montani	0f3e3eedc2	Add Tok2vec.remove_listener	2021-01-29 19:36:38 +11:00
Ines Montani	99842387cb	Remove default value	2021-01-29 18:45:37 +11:00
Ines Montani	44b5542d14	Change method order	2021-01-29 18:42:41 +11:00
Ines Montani	8c15d1daec	Update and validate config first and exit early if paths don't exist	2021-01-29 18:24:47 +11:00
Ines Montani	bbb94b37c6	Update error handling and docstring	2021-01-29 16:27:49 +11:00
Ines Montani	01ecfbcc45	Merge branch 'develop' into feature/replace-listeners	2021-01-29 15:57:32 +11:00
Ines Montani	911dfcccfc	Add option to replace listeners for sourced components	2021-01-29 15:57:04 +11:00
Adriane Boyd	fcce3600ed	Forbid OP matching 2+ tokens in DependencyMatcher (#6824 ) Instead of silently using only the first token in each matched span: * Forbid `OP: ?//+` through `DependencyMatcher` validation As a fail-safe, add warning if a token match that's not exactly one token long is found by a token pattern.	2021-01-29 08:52:01 +08:00
Sofie Van Landeghem	24a697abb8	avoid empty aliases and improve UX and docs (#6840 )	2021-01-29 08:51:40 +08:00
Sofie Van Landeghem	837a4f53c2	Error handling in nlp.pipe (#6817 ) * add error handler for pipe methods * add unit tests * remove pipe method that are the same as their base class * have Language keep track of a default error handler * cleanup * formatting * small refactor * add documentation	2021-01-29 08:51:21 +08:00
Ines Montani	cc18f3f23c	Improve Example error handling for NER data (#6835 ) * Improve Example error handling for NER data * Fix conditional	2021-01-28 13:11:20 +11:00
Ines Montani	78d6ff4dd4	Update quickstart recommendations	2021-01-28 11:14:49 +11:00
Ines Montani	ec5f55aa5b	Update config generation defaults and transformers (#6832 )	2021-01-27 23:56:33 +11:00
Adriane Boyd	4096a79de7	Add alignment mode error and fix Doc.char_span docs (#6820 ) * Raise an error on an unrecognized alignment mode rather than defaulting to `strict` * Fix the `Doc.char_span` API doc alignment mode details	2021-01-27 23:40:42 +11:00
Sofie Van Landeghem	6b68ad027b	Fix beam NER resizing (#6834 ) * move label check to sub methods * add tests	2021-01-27 23:39:14 +11:00
Ines Montani	5ed51c9dd2	Merge pull request #6828 from explosion/master-tmp	2021-01-27 23:05:46 +11:00
Adriane Boyd	d17afb4826	Add Spanish rule-based lemmatizer (#6833 ) * Initial Spanish lemmatizer * Handle merged verb+pron(s) multi-word tokens * Use VERB for AUX rule lookup * Add morph to lemma cache key * Fix aux lookups, minor refactoring * Improve verb+pron handling * Move verb+pron handling into its own method * Check for exceptions (primarily for se) * Collect pronouns in the same (not reversed) order * Only add modified possible lemmas	2021-01-27 19:21:35 +08:00
Ines Montani	615dba9d99	Fix tokenizer exceptions	2021-01-27 22:11:42 +11:00
Ines Montani	abb24fdc0f	Merge pull request #6827 from explosion/feature/add-labels-implicitly	2021-01-27 21:34:58 +11:00
Ines Montani	80ba9eaf7d	Fix test	2021-01-27 21:29:02 +11:00
Ines Montani	e3f8be9a94	Update language data	2021-01-27 13:29:22 +11:00
Ines Montani	230e651ad6	Merge branch 'develop' into master-tmp	2021-01-27 13:26:29 +11:00
Matthew Honnibal	05050210f3	Dont add labels implicitly for parser	2021-01-27 13:04:47 +11:00
Matthew Honnibal	1d20e21f3e	Add labels implicitly for parser and ner	2021-01-27 12:54:47 +11:00
Matthew Honnibal	68b1c2984d	Test labels are added implicitly	2021-01-27 12:52:29 +11:00
Ines Montani	fabd3a3394	Tidy up code comments [ci skip]	2021-01-27 12:40:03 +11:00
Dhruv Naik	e7db07a0b9	Fix Span.char_span bug (#6816 ) * Create dhruvrnaik.md * add test for issue #6815 * bugfix for issue #6815 * update dhruvrnaik.md * add span.vector test for #6815	2021-01-26 15:50:37 +08:00
Matthew Honnibal	e8674c5c42	Set version to v3.0.0rc5	2021-01-26 14:55:41 +11:00
Adriane Boyd	71a6350744	Implement overwrite param for all custom lemmatizers (#6794 )	2021-01-26 14:53:43 +11:00
Adriane Boyd	2263bc7b28	Update develop from master for v3.0.0rc5 (#6811 ) * Fix `spacy.util.minibatch` when the size iterator is finished (#6745) * Skip 0-length matches (#6759) Add hack to prevent matcher from returning 0-length matches. * support IS_SENT_START in PhraseMatcher (#6771) * support IS_SENT_START in PhraseMatcher * add unit test and friendlier error * use IDS.get instead * ensure span.text works for an empty span (#6772) * Remove unicode_literals Co-authored-by: Santiago Castro <bryant@montevideo.com.uy> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2021-01-26 14:52:45 +11:00
Ines Montani	c0926c9088	WIP: Various small training changes (#6818 ) * Allow output_path to be None during training * Fix cat scoring (?) * Improve error message for weighted None score * Improve messages So we can call this in other places etc. * FIx output path check * Use latest wasabi * Revert "Improve error message for weighted None score" This reverts commit `7059926763`. * Exclude None scores from final score by default It's otherwise very difficult to keep track of the score weights if we modify a config programmatically, source components etc. * Update warnings and use logger.warning	2021-01-26 14:51:52 +11:00
Matthew Honnibal	f049df1715	Revert "Set annotations in update" (#6810 ) * Revert "Set annotations in update (#6767)" This reverts commit `e680efc7cc`. * Fix version * Update spacy/pipeline/entity_linker.py * Update spacy/pipeline/entity_linker.py * Update spacy/pipeline/tagger.pyx * Update spacy/pipeline/tok2vec.py * Update spacy/pipeline/tok2vec.py * Update spacy/pipeline/transition_parser.pyx * Update spacy/pipeline/transition_parser.pyx * Update website/docs/api/multilabel_textcategorizer.md * Update website/docs/api/tok2vec.md * Update website/docs/usage/layers-architectures.md * Update website/docs/usage/layers-architectures.md * Update website/docs/api/transformer.md * Update website/docs/api/textcategorizer.md * Update website/docs/api/tagger.md * Update spacy/pipeline/entity_linker.py * Update website/docs/api/sentencerecognizer.md * Update website/docs/api/pipe.md * Update website/docs/api/morphologizer.md * Update website/docs/api/entityrecognizer.md * Update spacy/pipeline/entity_linker.py * Update spacy/pipeline/multitask.pyx * Update spacy/pipeline/tagger.pyx * Update spacy/pipeline/tagger.pyx * Update spacy/pipeline/textcat.py * Update spacy/pipeline/textcat.py * Update spacy/pipeline/textcat.py * Update spacy/pipeline/tok2vec.py * Update spacy/pipeline/trainable_pipe.pyx * Update spacy/pipeline/trainable_pipe.pyx * Update spacy/pipeline/transition_parser.pyx * Update spacy/pipeline/transition_parser.pyx * Update website/docs/api/entitylinker.md * Update website/docs/api/dependencyparser.md * Update spacy/pipeline/trainable_pipe.pyx	2021-01-25 22:18:45 +08:00
Matthew Honnibal	42b117e561	Fix Doc.copy bugs (#6809 ) * Dont let the Doc own LexemeC, to fix Doc.copy * Copy doc.spans * Copy doc.spans	2021-01-25 21:40:18 +08:00
Adriane Boyd	0f2de39efb	Fix types for exclude args in info CLI (#6808 )	2021-01-25 20:00:22 +08:00
muratjumashev	2b19ebad59	Remove Kyrgyz chars fr. char_classes since Tatar ones already cover	2021-01-25 00:46:45 +06:00
muratjumashev	87168eb81f	Add tests	2021-01-24 20:56:16 +06:00
muratjumashev	53abf759ad	Fix punctuation	2021-01-24 20:54:22 +06:00
Matthew Honnibal	ffc371350a	Avoid assuming encode.get_dim('nO') is set in tok2vec (#6800 )	2021-01-24 14:37:33 +11:00
muratjumashev	2a2646362b	Fix language subclass	2021-01-23 22:00:50 +06:00
muratjumashev	fe3b5b8ff5	Add kyrgyz to char_classes	2021-01-23 21:53:41 +06:00
muratjumashev	e30bbf5432	Add examples	2021-01-23 21:49:08 +06:00
muratjumashev	2f385385a9	Remove comment	2021-01-23 21:36:28 +06:00
muratjumashev	d53724ba1d	Add lex_attrs	2021-01-23 21:35:25 +06:00
muratjumashev	4418ec2eee	Add punctuation	2021-01-23 21:31:31 +06:00
muratjumashev	101d265778	Add stopwords	2021-01-23 21:25:28 +06:00
KeshavG-lb	0a86d833d7	Spacy Cli info method causing backward compatibility issues (#6793 ) * Spacy Cli info method causing backward compatibility issues #6791 fix backward compatibility by setting default value to exclude in info method. * setting empty list as default argument is dangerous. so setting default to None and then setting it to emptylist, if None. Reference : https://nikos7am.com/posts/mutable-default-arguments/	2021-01-23 11:21:43 +01:00
muratjumashev	28d06ab860	Add tokenizer_exceptions	2021-01-22 23:08:41 +06:00
Luigi Coniglio	e83c818a78	DependencyMatcher improvements (fix #6678 ) (#6744 ) * Adding contributor agreement for user werew * [DependencyMatcher] Comment and clean code * [DependencyMatcher] Use defaultdicts * [DependencyMatcher] Simplify _retrieve_tree method * [DependencyMatcher] Remove prepended underscores * [DependencyMatcher] Address TODO and move grouping of token's positions out of the loop * [DependencyMatcher] Remove _nodes attribute * [DependencyMatcher] Use enumerate in _retrieve_tree method * [DependencyMatcher] Clean unused vars and use camel_case naming * [DependencyMatcher] Memoize node+operator map * Add root property to Token * [DependencyMatcher] Groups matches by root * [DependencyMatcher] Remove unused _keys_to_token attribute * [DependencyMatcher] Use a list to map tokens to matcher's keys * [DependencyMatcher] Remove recursion * [DependencyMatcher] Use a generator to retrieve matches * [DependencyMatcher] Remove unused memory pool * [DependencyMatcher] Hide private methods and attributes * [DependencyMatcher] Improvements to the matches validation * Apply suggestions from code review Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com> * [DependencyMatcher] Fix keys_to_position_maps * Remove Token.root property * [DependencyMatcher] Remove functools' lru_cache Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2021-01-22 11:20:08 +11:00
Sofie Van Landeghem	5ace559201	ensure span.text works for an empty span (#6772 )	2021-01-21 23:18:46 +08:00
Sofie Van Landeghem	d93cd3b7c0	remove artificially duplicated test [ci skip]	2021-01-21 10:53:16 +01:00
Sofie Van Landeghem	fdf8c77630	support IS_SENT_START in PhraseMatcher (#6771 ) * support IS_SENT_START in PhraseMatcher * add unit test and friendlier error * use IDS.get instead	2021-01-21 09:59:17 +01:00
Sofie Van Landeghem	e680efc7cc	Set annotations in update (#6767 ) * bump to 3.0.0rc4 * do set_annotations in component update calls * update docs and remove set_annotations flag * fix EL test	2021-01-20 11:49:25 +11:00
Sofie Van Landeghem	57640aa838	warn when frozen components break listener pattern (#6766 ) * warn when frozen components break listener pattern * few notes in the documentation * update arg name * formatting * cleanup * specify listeners return type	2021-01-20 11:12:35 +11:00
Matthew Honnibal	88acbfc050	Copy the Example objects (and their predicted Doc) in nlp.evaluate() and nlp.update() (#6765 ) * Make copy of examples in nlp.update and nlp.evaluate * Avoid circular import * Fix evaluate	2021-01-19 16:47:44 +01:00
Sofie Van Landeghem	bfc212e68f	fix duplicate from merge [ci skip]	2021-01-19 12:14:35 +01:00
Adriane Boyd	bc7d83d4be	Skip 0-length matches (#6759 ) Add hack to prevent matcher from returning 0-length matches.	2021-01-19 07:38:11 +08:00
Sofie Van Landeghem	c8761b0e6e	rewrite Maxout layer as separate layers to avoid shape inference trouble (#6760 )	2021-01-19 07:37:17 +08:00
Adriane Boyd	26c34ab8b0	Fix parser resizing for cupy (#6758 )	2021-01-18 20:43:15 +01:00
Matthew Honnibal	c2a18e4fa3	Update textcat ensemble model	2021-01-19 02:53:02 +11:00
Ines Montani	e697609fef	Update docstrings and types [ci skip]	2021-01-18 22:31:26 +11:00
Ines Montani	f4d547b73c	Fix error code	2021-01-18 11:43:45 +11:00
Ines Montani	1090d3d675	Merge branch 'develop' into feature/spacy-legacy	2021-01-18 11:43:39 +11:00
Sofie Van Landeghem	fed8f48965	raise NotImplementedError when noun_chunks iterator is not implemented (#6711 ) * raise NotImplementedError when noun_chunks iterator is not implemented * bring back, fix and document span.noun_chunks * formatting Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2021-01-17 19:56:05 +08:00
Adriane Boyd	bf0cdae8d4	Add token_splitter component (#6726 ) * Add long_token_splitter component Add a `long_token_splitter` component for use with transformer pipelines. This component splits up long tokens like URLs into smaller tokens. This is particularly relevant for pretrained pipelines with `strided_spans`, since the user can't change the length of the span `window` and may not wish to preprocess the input texts. The `long_token_splitter` splits tokens that are at least `long_token_length` tokens long into smaller tokens of `split_length` size. Notes: * Since this is intended for use as the first component in a pipeline, the token splitter does not try to preserve any token annotation. * API docs to come when the API is stable. * Adjust API, add test * Fix name in factory	2021-01-17 19:54:41 +08:00
Santiago Castro	28256522c8	Fix `spacy.util.minibatch` when the size iterator is finished (#6745 )	2021-01-17 19:48:43 +08:00
Adriane Boyd	185fc62f4d	Remove unused is_base_form for mk lemmatizer (#6743 ) Remove unimplemented/incorrect is_base_form for Macedonian lemmatizer.	2021-01-17 09:41:35 +01:00
Adriane Boyd	43a752a2a0	Fix assertion in default get oracle sequence usage (#6738 ) Remove assertion for default debug value in `get_oracle_sequence_from_state`.	2021-01-16 16:07:39 +01:00
Ines Montani	a552db2819	Include available registry names in error	2021-01-16 14:35:03 +11:00
Matthew Honnibal	f0c696b4aa	Fix failed merge of #6694 patch	2021-01-16 13:44:11 +11:00
Ines Montani	d12be459f6	Raise RegistryError	2021-01-16 12:57:13 +11:00
Adriane Boyd	c8b4370865	Add all strings from source models (#6736 ) Add all strings from the source model when adding a pipe from a source model. Minor: * Skip `disable=["vocab", "tokenizer"]` when loading a source model from the config, since this doesn't do anything and is misleading.	2021-01-16 12:26:15 +11:00
Adriane Boyd	9328dd5625	Handle unset token.morph in Morphologizer (#6704 ) * Handle unset token.morph in Morphologizer Handle unset `token.morph` in `Morphologizer.initialize` and `Morphologizer.get_loss`. If both `token.morph` and `token.pos` are unset, treat the annotation as missing rather than empty. * Add token.has_morph()	2021-01-15 17:20:10 +01:00
Matthew Honnibal	7b3f0c6f1b	Questionable fix for parser training bug with misaligned sentences (#6694 ) * Questionable fix for parser training bug with misaligned sentences * Fix Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2021-01-15 14:18:24 +01:00
Ines Montani	a203e3dbb8	Support spacy-legacy via the registry	2021-01-15 21:42:40 +11:00
Ines Montani	f9e4ac1283	Fix test	2021-01-15 12:51:02 +11:00
Ines Montani	b0b743597c	Tidy up and auto-format	2021-01-15 11:57:36 +11:00
Ines Montani	e8a97a2bd6	Merge pull request #6720 from adrianeboyd/feature/improved-init-training-config-validation	2021-01-15 11:45:24 +11:00
Ines Montani	57369909c0	Merge pull request #6727 from adrianeboyd/chore/update-develop-from-master-rc3	2021-01-15 11:44:28 +11:00
Adriane Boyd	681a6195f7	Validate seed and gpu_allocator manually	2021-01-14 16:57:57 +01:00
Adriane Boyd	0c936004d1	Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-rc3	2021-01-14 11:49:58 +01:00
Matthew Honnibal	92310a5e26	Merge branch 'develop' into feature/missing-dep	2021-01-14 17:39:01 +11:00
Adriane Boyd	e649242927	Prevent overlapping noun chunks for Spanish (#6712 ) * Prevent overlapping noun chunks in Spanish noun chunk iterator * Clean up similar code in Danish noun chunk iterator	2021-01-14 17:33:31 +11:00
Adriane Boyd	9957ed7897	Override language defaults for null token and URL match (#6705 ) * Override language defaults for null token and URL match When the serialized `token_match` or `url_match` is `None`, override the language defaults to preserve `None` on deserialization. * Fix fixtures in tests	2021-01-14 17:31:29 +11:00
Matthew Honnibal	f277bfdf0f	Add SpanGroup and Graph container types to represent arbitrary annotations (#6696 ) * Draft out initial Spans data structure * Initial span group commit * Basic span group support on Doc * Basic test for span group * Compile span_group.pyx * Draft addition of SpanGroup to DocBin * Add deserialization for SpanGroup * Add tests for serializing SpanGroup * Fix serialization of SpanGroup * Add EdgeC and GraphC structs * Add draft Graph data structure * Compile graph * More work on Graph * Update GraphC * Upd graph * Fix walk functions * Let Graph take nodes and edges on construction * Fix walking and getting * Add graph tests * Fix import * Add module with the SpanGroups dict thingy * Update test * Rename 'span_groups' attribute * Try to fix c++11 compilation * Fix test * Update DocBin * Try to fix compilation * Try to fix graph * Improve SpanGroup docstrings * Add doc.spans to documentation * Fix serialization * Tidy up and add docs * Update docs [ci skip] * Add SpanGroup.has_overlap * WIP updated Graph API * Start testing new Graph API * Update Graph tests * Update Graph * Add docstring Co-authored-by: Ines Montani <ines@ines.io>	2021-01-14 17:30:41 +11:00
Adriane Boyd	54e8e3c208	Update model-related dependencies (#6725 ) * Update pymorphy2 error messages for Russian and Ukrainian * Add pymorphy2 to pex * Update spacy-pkuseg version for pex	2021-01-14 17:29:44 +11:00
svlandeg	fec9b81aa2	Merge remote-tracking branch 'upstream/develop' into feature/missing-dep	2021-01-13 17:46:12 +01:00
svlandeg	ed53bb979d	cleanup	2021-01-13 14:20:05 +01:00
svlandeg	86a4e316b8	fix sent_starts	2021-01-13 13:47:25 +01:00
Ines Montani	31a92b28ae	Merge pull request #6715 from adrianeboyd/feature/before-after-init-callbacks Add initialize.before_init and after_init callbacks	2021-01-13 12:17:00 +11:00
Ines Montani	97d5a7ba99	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2021-01-13 12:03:02 +11:00
Ines Montani	8d6448ccf7	Add config resolver test	2021-01-13 12:02:59 +11:00
svlandeg	232e953b14	pytest.approx with absolute eps	2021-01-12 20:32:57 +01:00
svlandeg	5b598bd1d5	formatting	2021-01-12 17:28:41 +01:00
svlandeg	a581d82f33	introduce token.has_head and refer to MISSING_DEP_ (WIP)	2021-01-12 17:17:06 +01:00
Adriane Boyd	5fb8b7037a	Expand initialize/training config validation Validate both `[initialize]` and `[training]` in `debug data` and `nlp.initialize()` with separate config validation error blocks that indicate which block of the config is being validated.	2021-01-12 17:17:00 +01:00
Adriane Boyd	a45d89f09a	Add initialize.before_init and after_init callbacks Add `initialize.before_init` and `initialize.after_init` callbacks to the config. The `initialize.before_init` callback is a place to implement one-time tokenizer customizations that are then saved with the model.	2021-01-12 13:07:44 +01:00
Adriane Boyd	ad43cbb042	Sync missing and misaligned values in Tagger loss (#6689 ) Use `None` for both missing and misaligned annotation in `Tagger.get_loss`, reverting to the default missing value in the loss function.	2021-01-10 11:30:37 +11:00
Matthew Honnibal	c04bab6bae	Fix train loop to avoid swallowing tracebacks (#6693 ) * Avoid swallowing tracebacks in train loop * Format * Handle first	2021-01-09 08:25:47 +08:00
Alex Combessie	9cc880014c	Remove questionable French stopwords (#6310 ) * Remove questionable French stopwords * Create alexcombessie.md	2021-01-08 11:36:22 +11:00
Cristiana S Parada	7a0222f260	Update stop_words.py in Portuguese (a,o,e) (#6345 ) * Update stop_words.py Added three aditional stopwords: "a" and "o" that means "the", and "e" that means "and" * Create cristianasp.md * zero edit to push CI Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2021-01-08 11:35:38 +11:00
Lorena Ciutacu	f11002f1f1	add new Romanian stopwords (#6621 ) * add contributor agreement * update ro stopwords list * add new stopwords	2021-01-08 11:34:47 +11:00
svlandeg	dd12c6c8fd	allow missing information in deps and heads annotations	2021-01-07 19:10:32 +01:00
svlandeg	1abeca90a6	refer to _parser_internals.nonproj.DELIMITER	2021-01-07 18:58:13 +01:00
Yohei Tamura	411c842a71	convert tuple to list, because the type mismatches (#6625 )	2021-01-07 16:42:12 +11:00
Sofie Van Landeghem	75d9019343	Fix types of Tok2Vec encoding architectures (#6442 ) * fix TorchBiLSTMEncoder documentation * ensure the types of the encoding Tok2vec layers are correct * update references from v1 to v2 for the new architectures	2021-01-07 16:39:27 +11:00
ophelielacroix	e3222fdec9	Add (noun chunks) syntax iterators for Danish (#6246 ) * add syntax iterators for danish * add test noun chunks for danish syntax iterators * add contributor agreement * update da syntax iterators to remove nested chunks * add tests for da noun chunks * Fix test * add missing import * fix example * Prevent overlapping noun chunks Prevent overlapping noun chunks by tracking the end index of the previous noun chunk span. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2021-01-07 16:33:00 +11:00
Sofie Van Landeghem	8c1a23209f	Getting scores out of beam_parser (#6684 ) * clean up of ner tests * beam_parser tests * implement get_beam_parses and scored_parses for the dep parser * we don't have to add the parse if there are no arcs	2021-01-07 16:28:27 +11:00
Sofie Van Landeghem	3983bc6b1e	Fix Transformer width in TextCatEnsemble (#6431 ) * add convenience method to determine tok2vec width in a model * fix transformer tok2vec dimensions in TextCatEnsemble architecture * init function should not be nested to avoid pickle issues	2021-01-06 12:44:04 +01:00
Sofie Van Landeghem	402dbc5bae	Getting scores out of beam_ner (#6575 ) * small fixes and formatting * bring test_issue4313 up-to-date, currently fails * formatting * add get_beam_parses method back * add scored_ents function * delete tag map	2021-01-06 12:02:32 +01:00
Sofie Van Landeghem	6f7e7d88b9	remove cause without apostrophe from norm exceptions (#6636 )	2021-01-06 12:30:30 +08:00
Adriane Boyd	bf9096437e	Set default lemmas in retokenizer (#6667 ) Instead of unsetting lemmas on retokenized tokens, set the default lemmas to: * merge: concatenate any existing lemmas with `SPACY` preserved * split: use the new `ORTH` values if lemmas were previously set, otherwise leave unset	2021-01-06 12:29:44 +08:00
Adriane Boyd	0041dfbc7f	Use special matcher for exceptions with spaces (#6668 ) Use the special cases phrase matcher for exceptions that include space characters so that exceptions including spaces are supported.	2021-01-06 12:05:10 +08:00
Sofie Van Landeghem	afc5714d32	multi-label textcat component (#6474 ) * multi-label textcat component * formatting * fix comment * cleanup * fix from #6481 * random edit to push the tests * add explicit error when textcat is called with multi-label gold data * fix error nr * small fix	2021-01-06 13:07:14 +11:00
Bruno	1a77607036	spaCy v3 is not saving the best version in training loop (#6629 ) * Save best only if is the best and also respect the average config * Create bratao.md * Update loop.py * Remove average check * Keep before_to_disk	2021-01-06 12:51:30 +11:00
Sofie Van Landeghem	29b59086f9	Prevent 0-length mem alloc (#6653 ) * prevent 0-length mem alloc by adding asserts * fix lexeme mem allocation	2021-01-06 12:50:17 +11:00
Ines Montani	6f83abb971	Merge pull request #6647 from svlandeg/feature/init_config_overwrite	2021-01-05 14:59:04 +11:00
Ines Montani	81f018fb67	Merge pull request #6671 from explosion/chore/tidy-autoformat Tidy up and auto-format	2021-01-05 14:45:31 +11:00
Ines Montani	224a3590e9	Merge pull request #6654 from svlandeg/chore/tests-cleanup Unskipping tests	2021-01-05 13:53:40 +11:00
Ines Montani	a9e845426f	Use --force for consistency and add docs	2021-01-05 13:49:59 +11:00
Ines Montani	c4993f16d0	Merge pull request #6651 from svlandeg/bugfix/cli_info	2021-01-05 13:44:26 +11:00

... 2 3 4 5 6 ...

8679 Commits