spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-03-18 09:02:29 +03:00

Author	SHA1	Message	Date
Ines Montani	cc18f3f23c	Improve Example error handling for NER data (#6835 ) * Improve Example error handling for NER data * Fix conditional	2021-01-28 13:11:20 +11:00
Ines Montani	78d6ff4dd4	Update quickstart recommendations	2021-01-28 11:14:49 +11:00
Ines Montani	ec5f55aa5b	Update config generation defaults and transformers (#6832 )	2021-01-27 23:56:33 +11:00
Adriane Boyd	4096a79de7	Add alignment mode error and fix Doc.char_span docs (#6820 ) * Raise an error on an unrecognized alignment mode rather than defaulting to `strict` * Fix the `Doc.char_span` API doc alignment mode details	2021-01-27 23:40:42 +11:00
Sofie Van Landeghem	6b68ad027b	Fix beam NER resizing (#6834 ) * move label check to sub methods * add tests	2021-01-27 23:39:14 +11:00
Ines Montani	5ed51c9dd2	Merge pull request #6828 from explosion/master-tmp	2021-01-27 23:05:46 +11:00
Adriane Boyd	d17afb4826	Add Spanish rule-based lemmatizer (#6833 ) * Initial Spanish lemmatizer * Handle merged verb+pron(s) multi-word tokens * Use VERB for AUX rule lookup * Add morph to lemma cache key * Fix aux lookups, minor refactoring * Improve verb+pron handling * Move verb+pron handling into its own method * Check for exceptions (primarily for se) * Collect pronouns in the same (not reversed) order * Only add modified possible lemmas	2021-01-27 19:21:35 +08:00
Ines Montani	615dba9d99	Fix tokenizer exceptions	2021-01-27 22:11:42 +11:00
Ines Montani	abb24fdc0f	Merge pull request #6827 from explosion/feature/add-labels-implicitly	2021-01-27 21:34:58 +11:00
Ines Montani	80ba9eaf7d	Fix test	2021-01-27 21:29:02 +11:00
Ines Montani	e3f8be9a94	Update language data	2021-01-27 13:29:22 +11:00
Ines Montani	230e651ad6	Merge branch 'develop' into master-tmp	2021-01-27 13:26:29 +11:00
Matthew Honnibal	05050210f3	Dont add labels implicitly for parser	2021-01-27 13:04:47 +11:00
Matthew Honnibal	1d20e21f3e	Add labels implicitly for parser and ner	2021-01-27 12:54:47 +11:00
Matthew Honnibal	68b1c2984d	Test labels are added implicitly	2021-01-27 12:52:29 +11:00
Ines Montani	fabd3a3394	Tidy up code comments [ci skip]	2021-01-27 12:40:03 +11:00
Dhruv Naik	e7db07a0b9	Fix Span.char_span bug (#6816 ) * Create dhruvrnaik.md * add test for issue #6815 * bugfix for issue #6815 * update dhruvrnaik.md * add span.vector test for #6815	2021-01-26 15:50:37 +08:00
Matthew Honnibal	e8674c5c42	Set version to v3.0.0rc5	2021-01-26 14:55:41 +11:00
Adriane Boyd	71a6350744	Implement overwrite param for all custom lemmatizers (#6794 )	2021-01-26 14:53:43 +11:00
Adriane Boyd	2263bc7b28	Update develop from master for v3.0.0rc5 (#6811 ) * Fix `spacy.util.minibatch` when the size iterator is finished (#6745) * Skip 0-length matches (#6759) Add hack to prevent matcher from returning 0-length matches. * support IS_SENT_START in PhraseMatcher (#6771) * support IS_SENT_START in PhraseMatcher * add unit test and friendlier error * use IDS.get instead * ensure span.text works for an empty span (#6772) * Remove unicode_literals Co-authored-by: Santiago Castro <bryant@montevideo.com.uy> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2021-01-26 14:52:45 +11:00
Ines Montani	c0926c9088	WIP: Various small training changes (#6818 ) * Allow output_path to be None during training * Fix cat scoring (?) * Improve error message for weighted None score * Improve messages So we can call this in other places etc. * FIx output path check * Use latest wasabi * Revert "Improve error message for weighted None score" This reverts commit `7059926763`. * Exclude None scores from final score by default It's otherwise very difficult to keep track of the score weights if we modify a config programmatically, source components etc. * Update warnings and use logger.warning	2021-01-26 14:51:52 +11:00
Matthew Honnibal	f049df1715	Revert "Set annotations in update" (#6810 ) * Revert "Set annotations in update (#6767)" This reverts commit `e680efc7cc`. * Fix version * Update spacy/pipeline/entity_linker.py * Update spacy/pipeline/entity_linker.py * Update spacy/pipeline/tagger.pyx * Update spacy/pipeline/tok2vec.py * Update spacy/pipeline/tok2vec.py * Update spacy/pipeline/transition_parser.pyx * Update spacy/pipeline/transition_parser.pyx * Update website/docs/api/multilabel_textcategorizer.md * Update website/docs/api/tok2vec.md * Update website/docs/usage/layers-architectures.md * Update website/docs/usage/layers-architectures.md * Update website/docs/api/transformer.md * Update website/docs/api/textcategorizer.md * Update website/docs/api/tagger.md * Update spacy/pipeline/entity_linker.py * Update website/docs/api/sentencerecognizer.md * Update website/docs/api/pipe.md * Update website/docs/api/morphologizer.md * Update website/docs/api/entityrecognizer.md * Update spacy/pipeline/entity_linker.py * Update spacy/pipeline/multitask.pyx * Update spacy/pipeline/tagger.pyx * Update spacy/pipeline/tagger.pyx * Update spacy/pipeline/textcat.py * Update spacy/pipeline/textcat.py * Update spacy/pipeline/textcat.py * Update spacy/pipeline/tok2vec.py * Update spacy/pipeline/trainable_pipe.pyx * Update spacy/pipeline/trainable_pipe.pyx * Update spacy/pipeline/transition_parser.pyx * Update spacy/pipeline/transition_parser.pyx * Update website/docs/api/entitylinker.md * Update website/docs/api/dependencyparser.md * Update spacy/pipeline/trainable_pipe.pyx	2021-01-25 22:18:45 +08:00
Matthew Honnibal	42b117e561	Fix Doc.copy bugs (#6809 ) * Dont let the Doc own LexemeC, to fix Doc.copy * Copy doc.spans * Copy doc.spans	2021-01-25 21:40:18 +08:00
Adriane Boyd	0f2de39efb	Fix types for exclude args in info CLI (#6808 )	2021-01-25 20:00:22 +08:00
muratjumashev	2b19ebad59	Remove Kyrgyz chars fr. char_classes since Tatar ones already cover	2021-01-25 00:46:45 +06:00
muratjumashev	87168eb81f	Add tests	2021-01-24 20:56:16 +06:00
muratjumashev	53abf759ad	Fix punctuation	2021-01-24 20:54:22 +06:00
Matthew Honnibal	ffc371350a	Avoid assuming encode.get_dim('nO') is set in tok2vec (#6800 )	2021-01-24 14:37:33 +11:00
muratjumashev	2a2646362b	Fix language subclass	2021-01-23 22:00:50 +06:00
muratjumashev	fe3b5b8ff5	Add kyrgyz to char_classes	2021-01-23 21:53:41 +06:00
muratjumashev	e30bbf5432	Add examples	2021-01-23 21:49:08 +06:00
muratjumashev	2f385385a9	Remove comment	2021-01-23 21:36:28 +06:00
muratjumashev	d53724ba1d	Add lex_attrs	2021-01-23 21:35:25 +06:00
muratjumashev	4418ec2eee	Add punctuation	2021-01-23 21:31:31 +06:00
muratjumashev	101d265778	Add stopwords	2021-01-23 21:25:28 +06:00
KeshavG-lb	0a86d833d7	Spacy Cli info method causing backward compatibility issues (#6793 ) * Spacy Cli info method causing backward compatibility issues #6791 fix backward compatibility by setting default value to exclude in info method. * setting empty list as default argument is dangerous. so setting default to None and then setting it to emptylist, if None. Reference : https://nikos7am.com/posts/mutable-default-arguments/	2021-01-23 11:21:43 +01:00
muratjumashev	28d06ab860	Add tokenizer_exceptions	2021-01-22 23:08:41 +06:00
Luigi Coniglio	e83c818a78	DependencyMatcher improvements (fix #6678 ) (#6744 ) * Adding contributor agreement for user werew * [DependencyMatcher] Comment and clean code * [DependencyMatcher] Use defaultdicts * [DependencyMatcher] Simplify _retrieve_tree method * [DependencyMatcher] Remove prepended underscores * [DependencyMatcher] Address TODO and move grouping of token's positions out of the loop * [DependencyMatcher] Remove _nodes attribute * [DependencyMatcher] Use enumerate in _retrieve_tree method * [DependencyMatcher] Clean unused vars and use camel_case naming * [DependencyMatcher] Memoize node+operator map * Add root property to Token * [DependencyMatcher] Groups matches by root * [DependencyMatcher] Remove unused _keys_to_token attribute * [DependencyMatcher] Use a list to map tokens to matcher's keys * [DependencyMatcher] Remove recursion * [DependencyMatcher] Use a generator to retrieve matches * [DependencyMatcher] Remove unused memory pool * [DependencyMatcher] Hide private methods and attributes * [DependencyMatcher] Improvements to the matches validation * Apply suggestions from code review Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com> * [DependencyMatcher] Fix keys_to_position_maps * Remove Token.root property * [DependencyMatcher] Remove functools' lru_cache Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2021-01-22 11:20:08 +11:00
Sofie Van Landeghem	5ace559201	ensure span.text works for an empty span (#6772 )	2021-01-21 23:18:46 +08:00
Sofie Van Landeghem	d93cd3b7c0	remove artificially duplicated test [ci skip]	2021-01-21 10:53:16 +01:00
Sofie Van Landeghem	fdf8c77630	support IS_SENT_START in PhraseMatcher (#6771 ) * support IS_SENT_START in PhraseMatcher * add unit test and friendlier error * use IDS.get instead	2021-01-21 09:59:17 +01:00
Sofie Van Landeghem	e680efc7cc	Set annotations in update (#6767 ) * bump to 3.0.0rc4 * do set_annotations in component update calls * update docs and remove set_annotations flag * fix EL test	2021-01-20 11:49:25 +11:00
Sofie Van Landeghem	57640aa838	warn when frozen components break listener pattern (#6766 ) * warn when frozen components break listener pattern * few notes in the documentation * update arg name * formatting * cleanup * specify listeners return type	2021-01-20 11:12:35 +11:00
Matthew Honnibal	88acbfc050	Copy the Example objects (and their predicted Doc) in nlp.evaluate() and nlp.update() (#6765 ) * Make copy of examples in nlp.update and nlp.evaluate * Avoid circular import * Fix evaluate	2021-01-19 16:47:44 +01:00
Sofie Van Landeghem	bfc212e68f	fix duplicate from merge [ci skip]	2021-01-19 12:14:35 +01:00
Adriane Boyd	bc7d83d4be	Skip 0-length matches (#6759 ) Add hack to prevent matcher from returning 0-length matches.	2021-01-19 07:38:11 +08:00
Sofie Van Landeghem	c8761b0e6e	rewrite Maxout layer as separate layers to avoid shape inference trouble (#6760 )	2021-01-19 07:37:17 +08:00
Adriane Boyd	26c34ab8b0	Fix parser resizing for cupy (#6758 )	2021-01-18 20:43:15 +01:00
Matthew Honnibal	c2a18e4fa3	Update textcat ensemble model	2021-01-19 02:53:02 +11:00
Ines Montani	e697609fef	Update docstrings and types [ci skip]	2021-01-18 22:31:26 +11:00
Ines Montani	f4d547b73c	Fix error code	2021-01-18 11:43:45 +11:00
Ines Montani	1090d3d675	Merge branch 'develop' into feature/spacy-legacy	2021-01-18 11:43:39 +11:00
Sofie Van Landeghem	fed8f48965	raise NotImplementedError when noun_chunks iterator is not implemented (#6711 ) * raise NotImplementedError when noun_chunks iterator is not implemented * bring back, fix and document span.noun_chunks * formatting Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2021-01-17 19:56:05 +08:00
Adriane Boyd	bf0cdae8d4	Add token_splitter component (#6726 ) * Add long_token_splitter component Add a `long_token_splitter` component for use with transformer pipelines. This component splits up long tokens like URLs into smaller tokens. This is particularly relevant for pretrained pipelines with `strided_spans`, since the user can't change the length of the span `window` and may not wish to preprocess the input texts. The `long_token_splitter` splits tokens that are at least `long_token_length` tokens long into smaller tokens of `split_length` size. Notes: * Since this is intended for use as the first component in a pipeline, the token splitter does not try to preserve any token annotation. * API docs to come when the API is stable. * Adjust API, add test * Fix name in factory	2021-01-17 19:54:41 +08:00
Santiago Castro	28256522c8	Fix `spacy.util.minibatch` when the size iterator is finished (#6745 )	2021-01-17 19:48:43 +08:00
Adriane Boyd	185fc62f4d	Remove unused is_base_form for mk lemmatizer (#6743 ) Remove unimplemented/incorrect is_base_form for Macedonian lemmatizer.	2021-01-17 09:41:35 +01:00
Adriane Boyd	43a752a2a0	Fix assertion in default get oracle sequence usage (#6738 ) Remove assertion for default debug value in `get_oracle_sequence_from_state`.	2021-01-16 16:07:39 +01:00
Ines Montani	a552db2819	Include available registry names in error	2021-01-16 14:35:03 +11:00
Matthew Honnibal	f0c696b4aa	Fix failed merge of #6694 patch	2021-01-16 13:44:11 +11:00
Ines Montani	d12be459f6	Raise RegistryError	2021-01-16 12:57:13 +11:00
Adriane Boyd	c8b4370865	Add all strings from source models (#6736 ) Add all strings from the source model when adding a pipe from a source model. Minor: * Skip `disable=["vocab", "tokenizer"]` when loading a source model from the config, since this doesn't do anything and is misleading.	2021-01-16 12:26:15 +11:00
Adriane Boyd	9328dd5625	Handle unset token.morph in Morphologizer (#6704 ) * Handle unset token.morph in Morphologizer Handle unset `token.morph` in `Morphologizer.initialize` and `Morphologizer.get_loss`. If both `token.morph` and `token.pos` are unset, treat the annotation as missing rather than empty. * Add token.has_morph()	2021-01-15 17:20:10 +01:00
Matthew Honnibal	7b3f0c6f1b	Questionable fix for parser training bug with misaligned sentences (#6694 ) * Questionable fix for parser training bug with misaligned sentences * Fix Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2021-01-15 14:18:24 +01:00
Ines Montani	a203e3dbb8	Support spacy-legacy via the registry	2021-01-15 21:42:40 +11:00
Ines Montani	f9e4ac1283	Fix test	2021-01-15 12:51:02 +11:00
Ines Montani	b0b743597c	Tidy up and auto-format	2021-01-15 11:57:36 +11:00
Ines Montani	e8a97a2bd6	Merge pull request #6720 from adrianeboyd/feature/improved-init-training-config-validation	2021-01-15 11:45:24 +11:00
Ines Montani	57369909c0	Merge pull request #6727 from adrianeboyd/chore/update-develop-from-master-rc3	2021-01-15 11:44:28 +11:00
Adriane Boyd	681a6195f7	Validate seed and gpu_allocator manually	2021-01-14 16:57:57 +01:00
Adriane Boyd	0c936004d1	Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-rc3	2021-01-14 11:49:58 +01:00
Matthew Honnibal	92310a5e26	Merge branch 'develop' into feature/missing-dep	2021-01-14 17:39:01 +11:00
Adriane Boyd	e649242927	Prevent overlapping noun chunks for Spanish (#6712 ) * Prevent overlapping noun chunks in Spanish noun chunk iterator * Clean up similar code in Danish noun chunk iterator	2021-01-14 17:33:31 +11:00
Adriane Boyd	9957ed7897	Override language defaults for null token and URL match (#6705 ) * Override language defaults for null token and URL match When the serialized `token_match` or `url_match` is `None`, override the language defaults to preserve `None` on deserialization. * Fix fixtures in tests	2021-01-14 17:31:29 +11:00
Matthew Honnibal	f277bfdf0f	Add SpanGroup and Graph container types to represent arbitrary annotations (#6696 ) * Draft out initial Spans data structure * Initial span group commit * Basic span group support on Doc * Basic test for span group * Compile span_group.pyx * Draft addition of SpanGroup to DocBin * Add deserialization for SpanGroup * Add tests for serializing SpanGroup * Fix serialization of SpanGroup * Add EdgeC and GraphC structs * Add draft Graph data structure * Compile graph * More work on Graph * Update GraphC * Upd graph * Fix walk functions * Let Graph take nodes and edges on construction * Fix walking and getting * Add graph tests * Fix import * Add module with the SpanGroups dict thingy * Update test * Rename 'span_groups' attribute * Try to fix c++11 compilation * Fix test * Update DocBin * Try to fix compilation * Try to fix graph * Improve SpanGroup docstrings * Add doc.spans to documentation * Fix serialization * Tidy up and add docs * Update docs [ci skip] * Add SpanGroup.has_overlap * WIP updated Graph API * Start testing new Graph API * Update Graph tests * Update Graph * Add docstring Co-authored-by: Ines Montani <ines@ines.io>	2021-01-14 17:30:41 +11:00
Adriane Boyd	54e8e3c208	Update model-related dependencies (#6725 ) * Update pymorphy2 error messages for Russian and Ukrainian * Add pymorphy2 to pex * Update spacy-pkuseg version for pex	2021-01-14 17:29:44 +11:00
svlandeg	fec9b81aa2	Merge remote-tracking branch 'upstream/develop' into feature/missing-dep	2021-01-13 17:46:12 +01:00
svlandeg	ed53bb979d	cleanup	2021-01-13 14:20:05 +01:00
svlandeg	86a4e316b8	fix sent_starts	2021-01-13 13:47:25 +01:00
Ines Montani	31a92b28ae	Merge pull request #6715 from adrianeboyd/feature/before-after-init-callbacks Add initialize.before_init and after_init callbacks	2021-01-13 12:17:00 +11:00
Ines Montani	97d5a7ba99	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2021-01-13 12:03:02 +11:00
Ines Montani	8d6448ccf7	Add config resolver test	2021-01-13 12:02:59 +11:00
svlandeg	232e953b14	pytest.approx with absolute eps	2021-01-12 20:32:57 +01:00
svlandeg	5b598bd1d5	formatting	2021-01-12 17:28:41 +01:00
svlandeg	a581d82f33	introduce token.has_head and refer to MISSING_DEP_ (WIP)	2021-01-12 17:17:06 +01:00
Adriane Boyd	5fb8b7037a	Expand initialize/training config validation Validate both `[initialize]` and `[training]` in `debug data` and `nlp.initialize()` with separate config validation error blocks that indicate which block of the config is being validated.	2021-01-12 17:17:00 +01:00
Adriane Boyd	a45d89f09a	Add initialize.before_init and after_init callbacks Add `initialize.before_init` and `initialize.after_init` callbacks to the config. The `initialize.before_init` callback is a place to implement one-time tokenizer customizations that are then saved with the model.	2021-01-12 13:07:44 +01:00
Adriane Boyd	ad43cbb042	Sync missing and misaligned values in Tagger loss (#6689 ) Use `None` for both missing and misaligned annotation in `Tagger.get_loss`, reverting to the default missing value in the loss function.	2021-01-10 11:30:37 +11:00
Matthew Honnibal	c04bab6bae	Fix train loop to avoid swallowing tracebacks (#6693 ) * Avoid swallowing tracebacks in train loop * Format * Handle first	2021-01-09 08:25:47 +08:00
Alex Combessie	9cc880014c	Remove questionable French stopwords (#6310 ) * Remove questionable French stopwords * Create alexcombessie.md	2021-01-08 11:36:22 +11:00
Cristiana S Parada	7a0222f260	Update stop_words.py in Portuguese (a,o,e) (#6345 ) * Update stop_words.py Added three aditional stopwords: "a" and "o" that means "the", and "e" that means "and" * Create cristianasp.md * zero edit to push CI Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2021-01-08 11:35:38 +11:00
Lorena Ciutacu	f11002f1f1	add new Romanian stopwords (#6621 ) * add contributor agreement * update ro stopwords list * add new stopwords	2021-01-08 11:34:47 +11:00
svlandeg	dd12c6c8fd	allow missing information in deps and heads annotations	2021-01-07 19:10:32 +01:00
svlandeg	1abeca90a6	refer to _parser_internals.nonproj.DELIMITER	2021-01-07 18:58:13 +01:00
Yohei Tamura	411c842a71	convert tuple to list, because the type mismatches (#6625 )	2021-01-07 16:42:12 +11:00
Sofie Van Landeghem	75d9019343	Fix types of Tok2Vec encoding architectures (#6442 ) * fix TorchBiLSTMEncoder documentation * ensure the types of the encoding Tok2vec layers are correct * update references from v1 to v2 for the new architectures	2021-01-07 16:39:27 +11:00
ophelielacroix	e3222fdec9	Add (noun chunks) syntax iterators for Danish (#6246 ) * add syntax iterators for danish * add test noun chunks for danish syntax iterators * add contributor agreement * update da syntax iterators to remove nested chunks * add tests for da noun chunks * Fix test * add missing import * fix example * Prevent overlapping noun chunks Prevent overlapping noun chunks by tracking the end index of the previous noun chunk span. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2021-01-07 16:33:00 +11:00
Sofie Van Landeghem	8c1a23209f	Getting scores out of beam_parser (#6684 ) * clean up of ner tests * beam_parser tests * implement get_beam_parses and scored_parses for the dep parser * we don't have to add the parse if there are no arcs	2021-01-07 16:28:27 +11:00
Sofie Van Landeghem	3983bc6b1e	Fix Transformer width in TextCatEnsemble (#6431 ) * add convenience method to determine tok2vec width in a model * fix transformer tok2vec dimensions in TextCatEnsemble architecture * init function should not be nested to avoid pickle issues	2021-01-06 12:44:04 +01:00
Sofie Van Landeghem	402dbc5bae	Getting scores out of beam_ner (#6575 ) * small fixes and formatting * bring test_issue4313 up-to-date, currently fails * formatting * add get_beam_parses method back * add scored_ents function * delete tag map	2021-01-06 12:02:32 +01:00
Sofie Van Landeghem	6f7e7d88b9	remove cause without apostrophe from norm exceptions (#6636 )	2021-01-06 12:30:30 +08:00
Adriane Boyd	bf9096437e	Set default lemmas in retokenizer (#6667 ) Instead of unsetting lemmas on retokenized tokens, set the default lemmas to: * merge: concatenate any existing lemmas with `SPACY` preserved * split: use the new `ORTH` values if lemmas were previously set, otherwise leave unset	2021-01-06 12:29:44 +08:00
Adriane Boyd	0041dfbc7f	Use special matcher for exceptions with spaces (#6668 ) Use the special cases phrase matcher for exceptions that include space characters so that exceptions including spaces are supported.	2021-01-06 12:05:10 +08:00
Sofie Van Landeghem	afc5714d32	multi-label textcat component (#6474 ) * multi-label textcat component * formatting * fix comment * cleanup * fix from #6481 * random edit to push the tests * add explicit error when textcat is called with multi-label gold data * fix error nr * small fix	2021-01-06 13:07:14 +11:00
Bruno	1a77607036	spaCy v3 is not saving the best version in training loop (#6629 ) * Save best only if is the best and also respect the average config * Create bratao.md * Update loop.py * Remove average check * Keep before_to_disk	2021-01-06 12:51:30 +11:00
Sofie Van Landeghem	29b59086f9	Prevent 0-length mem alloc (#6653 ) * prevent 0-length mem alloc by adding asserts * fix lexeme mem allocation	2021-01-06 12:50:17 +11:00
Ines Montani	6f83abb971	Merge pull request #6647 from svlandeg/feature/init_config_overwrite	2021-01-05 14:59:04 +11:00
Ines Montani	81f018fb67	Merge pull request #6671 from explosion/chore/tidy-autoformat Tidy up and auto-format	2021-01-05 14:45:31 +11:00
Ines Montani	224a3590e9	Merge pull request #6654 from svlandeg/chore/tests-cleanup Unskipping tests	2021-01-05 13:53:40 +11:00
Ines Montani	a9e845426f	Use --force for consistency and add docs	2021-01-05 13:49:59 +11:00
Ines Montani	c4993f16d0	Merge pull request #6651 from svlandeg/bugfix/cli_info	2021-01-05 13:44:26 +11:00
Ines Montani	991669c934	Tidy up and auto-format	2021-01-05 13:41:53 +11:00
Adriane Boyd	b57be94c78	Fix memory issues in Language.evaluate (#6386 ) * Fix memory issues in Language.evaluate Reset annotation in predicted docs before evaluating and store all data in `examples`. * Minor refactor to docs generator init * Fix generator expression * Fix final generator check * Refactor pipeline loop * Handle examples generator in Language.evaluate * Add test with generator * Use make_doc	2020-12-31 10:45:50 +11:00
svlandeg	a6a68da673	unskipping tests with python >= 3.6	2020-12-30 18:46:43 +01:00
svlandeg	d5ff0fecf8	add docs	2020-12-30 14:01:13 +01:00
svlandeg	c74ab6a313	fix imports	2020-12-30 12:40:12 +01:00
svlandeg	712a78b74a	add simple unit test	2020-12-30 12:35:26 +01:00
svlandeg	4347e6d39b	fixes for CLI info command	2020-12-30 12:05:58 +01:00
svlandeg	62b4fe118f	prevent overwriting existing config file	2020-12-29 15:40:22 +01:00
Adam Bittlingmayer	f2fe60bacf	Update tokenizer_exceptions.py See https://github.com/explosion/spaCy/pull/6643	2020-12-29 16:05:11 +04:00
Adriane Boyd	5ca57d8221	Add logger warning when serializing user hooks (#6595 ) Add a warning that user hooks are lost on serialization. Add a `user_hooks` exclude to skip the warning with pickle.	2020-12-29 11:54:32 +01:00
Yosi	cf52510631	Add Amharic አማርኛ Language support (#6583 ) * Add Amharic to space * clean up * Add some PRON_LEMMA * add Tigrinya support * remove text_noun_chunks * Tigrinya Support * added some more details for ti * fix unit test * add amharic char range * changes from review * amharic and tigrinya share same unicode block * get rid of _amharic/_tigrinya in char_classes Co-authored-by: Josiah Solomon <jsolomon@meteorcomm.com>	2020-12-22 16:50:34 +01:00
Tim Gates	292c1d6a73	docs: fix simple typo, speficied -> specified (#6611 ) There is a small typo in spacy/cli/info.py. Should read `specified` rather than `speficied`.	2020-12-22 09:14:10 +01:00
Adriane Boyd	cabd4ae5b1	Use logger.warning instead of logger.warn (#6596 ) Use `logger.warning` instead of deprecated `logger.warn`.	2020-12-21 08:25:10 +08:00
Sofie Van Landeghem	282a3b49ea	Fix parser resizing when there is no upper layer (#6460 ) * allow resizing of the parser model even when upper=False * update from spacy.TransitionBasedParser.v1 to v2 * bugfix	2020-12-18 18:56:57 +08:00
Sofie Van Landeghem	0a923a7915	Tagger robustness (#6580 ) * require labels in taggers * ensure tagger works with incomplete data	2020-12-18 18:51:47 +08:00
Adriane Boyd	e10295c9fd	Fix memory leak when adding empty morph (#6581 ) Fix lookup of empty morph in the morphology table, which fixes a memory leak where a new morphology tag was allocated each time the empty morph tag was added.	2020-12-18 18:51:01 +08:00
Ines Montani	e9b0963827	Merge pull request #6333 from adrianeboyd/chore/python39	2020-12-17 22:11:57 +11:00
Ines Montani	47c1ec678b	Merge branch 'develop' into pr/6333	2020-12-17 10:19:28 +11:00
Ines Montani	3f90bffa27	Merge pull request #6571 from adrianeboyd/bugfix/debug-data-missing-vectors Fix alignment and vector checks in debug data	2020-12-17 10:10:47 +11:00
Thomas Bird	cbb8c66da3	prevent the root logger from inialising	2020-12-15 19:50:34 +00:00
Adriane Boyd	1ddf2f39c7	Switch converters to generator functions (#6547 ) * Switch converters to generator functions To reduce the memory usage when converting large corpora, refactor the convert methods to be generator functions. * Update tests	2020-12-15 16:47:16 +08:00
Adriane Boyd	20e18cc246	Fix alignment and vector checks in debug data * Update token alignment check to use Example alignment * Update missing vector check further related to changes in v3	2020-12-15 09:43:14 +01:00
Matthew Honnibal	8656a08777	Add beam_parser and beam_ner components for v3 (#6369 ) * Get basic beam tests working * Get basic beam tests working * Compile _beam_utils * Remove prints * Test beam density * Beam parser seems to train * Draft beam NER * Upd beam * Add hypothesis as dev dependency * Implement missing is-gold-parse method * Implement early update * Fix state hashing * Fix test * Fix test * Default to non-beam in parser constructor * Improve oracle for beam * Start refactoring beam * Update test * Refactor beam * Update nn * Refactor beam and weight by cost * Update ner beam settings * Update test * Add __init__.pxd * Upd test * Fix test * Upd test * Fix test * Remove ring buffer history from StateC * WIP change arc-eager transitions * Add state tests * Support ternary sent start values * Fix arc eager * Fix NER * Pass oracle cut size for beam * Fix ner test * Fix beam * Improve StateC.clone * Improve StateClass.borrow * Work directly with StateC, not StateClass * Remove print statements * Fix state copy * Improve state class * Refactor parser oracles * Fix arc eager oracle * Fix arc eager oracle * Use a vector to implement the stack * Refactor state data structure * Fix alignment of sent start * Add get_aligned_sent_starts method * Add test for ae oracle when bad sentence starts * Fix sentence segment handling * Avoid Reduce that inserts illegal sentence * Update preset SBD test * Fix test * Remove prints * Fix sent starts in Example * Improve python API of StateClass * Tweak comments and debug output of arc eager * Upd test * Fix state test * Fix state test	2020-12-13 09:08:32 +08:00
Ines Montani	513c4e332a	Include custom code via spacy package command (#6531 )	2020-12-10 20:36:46 +08:00
Adriane Boyd	7b277661f6	Set version to v2.3.5	2020-12-10 13:32:10 +01:00
Ines Montani	2a6043fabb	Merge pull request #6530 from explosion/feature/init-config-cpu-gpu	2020-12-10 09:38:46 +11:00
Ines Montani	9d32e839d3	Merge branch 'develop' into feature/init-config-cpu-gpu	2020-12-10 08:50:53 +11:00
Adriane Boyd	6ee6e41234	Update docstring for Language.evaluate	2020-12-09 10:21:39 +01:00
Adriane Boyd	fa8fa474a3	Add nlp.batch_size setting Add a default `batch_size` setting for `Language.pipe` and `Language.evaluate` as `nlp.batch_size`.	2020-12-09 09:13:26 +01:00
Ines Montani	f2571b5ec4	Merge pull request #6444 from adrianeboyd/chore/update-develop-from-master	2020-12-09 13:09:58 +11:00
Ines Montani	90171f2031	Merge pull request #6528 from svlandeg/feature/pipe_fill_config	2020-12-09 12:01:22 +11:00
Ines Montani	dfaef27f90	Merge pull request #6503 from adrianeboyd/feature/lemmatizer-rule-warning-pos Warn on empty POS for the rule-based lemmatizer	2020-12-09 11:34:16 +11:00
Ines Montani	271923eaea	Fix retokenizer	2020-12-09 11:29:55 +11:00
Ines Montani	b85bd63eca	Fix test	2020-12-09 11:24:01 +11:00
Ines Montani	febf71af28	Fix test	2020-12-09 11:23:07 +11:00
Ines Montani	1da1568110	Remove tag map	2020-12-09 11:13:49 +11:00
Ines Montani	1980203229	Merge branch 'master' into pr/6444	2020-12-09 11:09:40 +11:00
Ines Montani	05a2812ae0	Merge branch 'develop' into pr/6444	2020-12-09 11:04:03 +11:00
Ines Montani	758ad6c3cd	Make CPU the default for init config	2020-12-09 11:00:51 +11:00
Ines Montani	5d605d539d	Remove output_file from init_config helper	2020-12-09 10:57:55 +11:00
Sofie Van Landeghem	cfc72c2995	Bugfix multi-label textcat reproducibility (#6481 ) * add test for multi-label textcat reproducibility * remove positive_label * fix lengths dtype * fix comments * remove comment that we should not have forgotten :-)	2020-12-09 06:29:15 +08:00
Sofie Van Landeghem	de108ed3e8	Add specific error when StaticVectors can't read the vectors data (#6450 )	2020-12-09 06:16:07 +08:00
Koichi Yasuoka	0afb54ac93	JapaneseTokenizer.pipe added (#6515 ) * JapaneseTokenizer.pipe added For [spacymoji](https://spacy.io/universe/project/spacymoji) with `Japanese()`. * DummyTokenizer.pipe added instead	2020-12-08 20:02:23 +01:00
svlandeg	8f8a7f1733	returning config in init_config	2020-12-08 17:37:20 +01:00
Ines Montani	8921364579	Merge pull request #6521 from explosion/feature/config-stdin Allow reading config from stdin in spacy train	2020-12-08 22:07:43 +11:00
Ines Montani	6c7a930ee8	Fix variable	2020-12-08 20:44:59 +11:00
Ines Montani	94a5a9814f	Update argument handling and documentation	2020-12-08 20:41:18 +11:00
Adriane Boyd	6c221d4841	Fix subsequent pipe detection in EntityRuler Fix subsequent pipe detection to detect the position of the current object by comparing the component itself rather than from the factory name.	2020-12-08 10:01:30 +01:00
Adriane Boyd	5ceac425ee	Remove non-working --use-chars from train CLI Remove the non-working `--use-chars` option from the train CLI. The implementation of the option across component types and the CLI settings could be fixed, but the `CharacterEmbed` model does not work on GPU in v2 so it's better to remove it.	2020-12-08 08:30:00 +01:00
Ines Montani	d25b1606d6	Allow reading config from sdtin in spacy train	2020-12-08 18:01:40 +11:00
Ines Montani	6cfa66ed1c	Make training.loop return nlp object and path (#6520 )	2020-12-08 14:55:55 +08:00
Sofie Van Landeghem	2c27093c5f	require_cpu functionality (#6336 ) * add require_cpu from Thinc 8.0.0rc2 * add docs * fix test if cupy is not installed	2020-12-08 14:42:40 +08:00
Sofie Van Landeghem	f98a04434a	pretrain architectures (#6451 ) * define new architectures for the pretraining objective * add loss function as attr of the omdel * cleanup * cleanup * shorten name * fix typo * remove unused error	2020-12-08 14:41:03 +08:00
Adriane Boyd	29b058ebdc	Fix spacy when retokenizing cases with affixes (#6475 ) Preserve `token.spacy` corresponding to the span end token in the original doc rather than adjusting for the current offset. * If not modifying in place, this checks in the original document (`doc.c` rather than `tokens`). * If modifying in place, the document has not been modified past the current span start position so the value at the current span end position is valid.	2020-12-08 14:25:56 +08:00
Adriane Boyd	4448680750	Fix alignment for 1-to-1 tokens and lowercasing (#6476 ) * When checking for token alignments, check not only that the tokens are identical but that the character positions are both at the start of a token. It's possible for the tokens to be identical even though the two tokens aren't aligned one-to-one in a case like `["a'", "''"]` vs. `["a", "''", "'"]`, where the middle tokens are identical but should not be aligned on the token level at character position 2 since it's the start of one token but the middle of another. * Use the lowercased version of the token texts to create the character-to-token alignment because lowercasing can change the string length (e.g., for `İ`, see the not-a-bug bug report: https://bugs.python.org/issue34723)	2020-12-08 14:25:16 +08:00
Adriane Boyd	e931d3f72b	Move max_length to nlp.make_doc() (#6512 ) Move max_length check to `nlp.make_doc()` so that's it's also checked for `nlp.pipe()`.	2020-12-08 14:24:02 +08:00
Ines Montani	ee2ec52f48	Merge pull request #6409 from svlandeg/feature/trf-docs	2020-12-08 06:32:10 +01:00
Ines Montani	82e88f0e3b	Merge pull request #6379 from svlandeg/fix/labels-constructor	2020-12-08 06:29:56 +01:00
Adriane Boyd	d70950605c	Warn on empty POS for the rule-based lemmatizer Add a warning to the rule-based lemmatizer for any tokens without POS annotation.	2020-12-04 11:46:15 +01:00
Adriane Boyd	78085fab1f	Check for spacy-nightly package in download (#6502 ) Also check for spacy-nightly in download so that `--no-deps` isn't set for normal nightly installs.	2020-12-04 09:40:03 +01:00
Ines Montani	63f83e7034	Merge pull request #6470 from adrianeboyd/feature/license-in-package	2020-12-04 03:55:54 +01:00
Sofie Van Landeghem	d6c616a125	Fixes in test suite (#6457 ) * fix slow test for textcat readers * cleanup test_issue5551 * add explicit score weight * cleanup	2020-12-02 12:57:08 +01:00
Adriane Boyd	31ec9a906e	Clean up 3rd party license info (#6478 ) Move scikit-learn license from `Scorer` to `licenses/3rd_party_licenses.txt`.	2020-12-02 10:15:23 +01:00
Adriane Boyd	591cd48aa8	Remove config.cfg from MANIFEST	2020-12-01 12:58:02 +01:00
Adriane Boyd	b0dd13e0ba	Support LICENSE in spacy package If present, include the file `input_dir/LICENSE` at the top level of the packaged model.	2020-11-30 13:43:58 +01:00
Adriane Boyd	53c0fb7431	Only set NORM on Token in retokenizer (#6464 ) * Only set NORM on Token in retokenizer Instead of setting `NORM` on both the token and lexeme, set `NORM` only on the token. The retokenizer tries to set all possible attributes with `Token/Lexeme.set_struct_attr` so that it doesn't have to enumerate which attributes are available for each. `NORM` is the only attribute that's stored on both and for most cases it doesn't make sense to set the global norms based on a individual retokenization. For lexeme-only attributes like `IS_STOP` there's no way to avoid the global side effects, but I think that `NORM` would be better only on the token. * Fix test	2020-11-30 09:35:42 +08:00
Adriane Boyd	03ae77e603	Add SPACY as a Matcher attribute (#6463 )	2020-11-30 09:34:50 +08:00
Sofie Van Landeghem	079f6ea474	avoid resolving the full config (#6465 )	2020-11-30 09:34:29 +08:00
Ines Montani	9beba7164f	Make jinja2 top-level import No problem anymore since it's now an official dependency	2020-11-27 15:17:14 +08:00
Adriane Boyd	26296ab223	Add error message if DocBin zlib decompress fails (#6394 ) Add a better error message if DocBin zlib decompress fails, indicating that the data is not in `DocBin` format.	2020-11-27 14:39:49 +08:00
Adriane Boyd	3a5cc5f8b4	Set version to v2.3.4	2020-11-26 08:48:52 +01:00
Adriane Boyd	e0f5646a4a	Restore cleanup_beam method (#6446 )	2020-11-25 13:21:48 +01:00
Adriane Boyd	cf693f0eae	Fix token_match in tokenizer	2020-11-25 11:49:34 +01:00
Adriane Boyd	724831b066	Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master * Update Macedonian for v3 * Update Turkish for v3	2020-11-25 11:49:34 +01:00
Adriane Boyd	573f5c863f	Fix tag map clobbering in spacy train (#6437 ) Fix bug from #5768 where the tag map is clobbered if a custom tag map isn't provided.	2020-11-24 13:13:16 +01:00
Adriane Boyd	ce18fc6588	Set version to v2.3.3	2020-11-24 10:03:45 +01:00
Adriane Boyd	cd61d264ef	Set version to v2.3.3.dev0	2020-11-23 13:51:59 +01:00
Sofie Van Landeghem	2af31a8c8d	Bugfix textcat reproducibility on GPU (#6411 ) * add seed argument to ParametricAttention layer * bump thinc to 7.4.3 * set thinc version range Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2020-11-23 12:29:35 +01:00
Adriane Boyd	3f61f5eb54	Use int8_t instead of char in Matcher (#6413 ) * Use signed char instead of char in Matcher Remove unused char* utf8_t typedef * Use int8_t instead of signed char	2020-11-23 10:26:47 +01:00
Adriane Boyd	4284605683	Remove Beam cleanup (#6414 ) Beam cleanup is handled through the Beam finalization method.	2020-11-23 10:01:46 +01:00
Adriane Boyd	a8c2dad466	Add all vectors to vocab before pruning (#6408 ) Add all vectors to the vocab before pruning to correct the selection of vectors to prioritize.	2020-11-23 10:00:59 +01:00
svlandeg	636be3c791	Merge remote-tracking branch 'upstream/develop' into feature/trf-docs	2020-11-19 14:15:35 +01:00
svlandeg	73fc1ed963	remove labels from morphologizer constructor	2020-11-11 21:48:50 +01:00
svlandeg	d5a920325f	remove labels from constructor	2020-11-11 21:34:12 +01:00
Adriane Boyd	320a8b1481	Add ent_id_ to strings serialized with Doc (#6353 )	2020-11-10 20:16:07 +08:00
Adriane Boyd	a7e7d6c6c9	Ignore misaligned in Morphologizer.get_loss (#6363 ) Fix bug where `Morphologizer.get_loss` treated misaligned annotation as `EMPTY_MORPH` rather than ignoring it. Remove unneeded default `EMPTY_MORPH` mappings.	2020-11-10 20:15:09 +08:00
Sofie Van Landeghem	a0c899a0ff	Fix textcat + transformer architecture (#6371 ) * add pooling to textcat TransformerListener * maybe_get_dim in case it's null	2020-11-10 20:14:47 +08:00
Ines Montani	de6453940e	Merge pull request #6305 from svlandeg/feature/score-docs [ci skip]	2020-11-10 02:52:11 +01:00
Ines Montani	d7950c5ada	Merge pull request #6297 from adrianeboyd/docs/nightly-conda-install [ci skip]	2020-11-10 02:45:52 +01:00
svlandeg	789fb3d124	add docs for upstream argument of TransformerListener	2020-11-09 21:42:58 +01:00
Ines Montani	363ac73c72	Update docs [ci skip]	2020-11-09 12:43:26 +08:00
Daniel Vasic	20d72de986	Added Multext-East V5 tagset for Croatian language (#6248 ) * Added Multext-East V5 tagset for Croatian language * Create danielvasic.md * Update danielvasic.md * Update danielvasic.md * Add tag map to CroatianDefaults Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2020-11-05 12:19:22 +01:00
Robert Šípek	6069efe57d	Add tag map to cs language (#6284 )	2020-11-05 10:13:11 +01:00
Vu Ha	6d465ec52c	add oprd to the list of accepted deps for noun chunking (#6302 ) * add oprd to the list of accepted deps for noun chunking * add SCA	2020-11-05 09:17:35 +01:00
Adriane Boyd	31de700b0f	Fix on_match callback and remove empty patterns (#6312 ) For the `DependencyMatcher`: * Fix on_match callback so that it is called once per matched pattern * Fix results so that patterns with empty match lists are not returned	2020-11-05 09:16:26 +01:00
Sofie Van Landeghem	8ef056cf98	fix embed_size in Entity Linker architecture (#6343 )	2020-11-04 22:20:13 +01:00
Adriane Boyd	084fc575aa	Set version to v3.0.0rc3	2020-11-03 17:29:57 +01:00
Adriane Boyd	1c4df8fd09	Replace pytokenizations with internal alignment (#6293 ) * Replace pytokenizations with internal alignment Replace pytokenizations with internal alignment algorithm that is restricted to only allow differences in whitespace and capitalization. * Rename `spacy.training.align` to `spacy.training.alignment` to contain the `Alignment` dataclass * Implement `get_alignments` in `spacy.training.align` * Refactor trailing whitespace handling * Remove unnecessary exception for empty docs Allow a non-empty whitespace-only doc to be aligned with an empty doc * Remove empty docs exceptions completely	2020-11-03 16:24:38 +01:00
Adriane Boyd	a4b32b9552	Handle missing reference values in scorer (#6286 ) * Handle missing reference values in scorer Handle missing values in reference doc during scoring where it is possible to detect an unset state for the attribute. If no reference docs contain annotation, `None` is returned instead of a score. `spacy evaluate` displays `-` for missing scores and the missing scores are saved as `None`/`null` in the metrics. Attributes without unset states: * `token.head`: relies on `token.dep` to recognize unset values * `doc.cats`: unable to handle missing annotation Additional changes: * add optional `has_annotation` check to `score_scans` to replace `doc.sents` hack * update `score_token_attr_per_feat` to handle missing and empty morph representations * fix bug in `Doc.has_annotation` for normalization of `IS_SENT_START` vs. `SENT_START` * Fix import * Update return types	2020-11-03 15:47:18 +01:00
Adriane Boyd	5d2cb86c34	Fix on_match callback for DependencyMatcher (#6313 ) Fix `DependencyMatcher` so that the callback is called only once per match.	2020-10-31 12:20:27 +01:00
Adriane Boyd	45c9a68828	Identify final Matcher pattern node by quantifier (#6317 ) Modify the internal pattern representation in `Matcher` patterns to identify the final ID state using a unique quantifier rather than a combination of other attributes. It was insufficient to identify the final ID node based on an uninitialized `quantifier` (coincidentally being the same as the `ZERO`) with `nr_attr` as 0. (In addition, it was potentially bug-prone that `nr_attr` was set to 0 even though attrs were allocated.) In the case of `{"OP": "!"}` (a valid, if pointless, pattern), `nr_attr` is 0 and the quantifier is ZERO, so the previous methods for incrementing to the ID node at the end of the pattern weren't able to distinguish the final ID node from the `{"OP": "!"}` pattern.	2020-10-31 12:18:48 +01:00
Sofie Van Landeghem	2918923541	fix resolving of dot notation (#6326 )	2020-10-31 12:17:06 +01:00
Duygu Altinok	0e55f806dd	Turkish tokenization improvements (#6268 ) * added single and paired orth variants * added token match * added long text tokenization test * inverted init * normalized lemmas to lowercase * more abbrevs * tests for ordinals and abbrevs * separated period abbvrevs to another list * fiex typo * added ordinal and abbrev tests * added number tests for dates * minor refinement * added inflected abbrevs regex * added percentage and inflection * cosmetics * added token match * added url inflection tests * excluded url tokens from custom pattern * removed url match import	2020-10-29 09:43:17 +01:00
svlandeg	080066ae74	remove TODO note	2020-10-26 10:37:25 +01:00
Ines Montani	2c9804038d	Fix success message [ci skip]	2020-10-23 16:11:54 +02:00
Adriane Boyd	4299a7f654	Setup / install / quickstart updates * Add `cuda110` to setup.cfg and quickstart dropdown * Switch to `pip` for pip-only packages in conda quickstart instructions * Update zh pkuseg install message with version range and conda * Remove `zh` from `extras_require` because the default doesn't require additional packages	2020-10-23 11:27:54 +02:00
Adriane Boyd	563a21834e	Save raw scores in evaluate output	2020-10-19 15:49:09 +02:00
Adriane Boyd	dd207ca6d0	Add dep_las_per_type and more generic PRF printer	2020-10-19 15:49:02 +02:00
Adriane Boyd	4300858ecb	Include per-type/feat scores in evaluate output	2020-10-19 15:48:55 +02:00
Sofie Van Landeghem	75a202ce65	TextCat updates and fixes (#6263 ) * small fix in example imports * throw error when train_corpus or dev_corpus is not a string * small fix in custom logger example * limit macro_auc to labels with 2 annotations * fix typo * also create parents of output_dir if need be * update documentation of textcat scores * refactor TextCatEnsemble * fix tests for new AUC definition * bump to 3.0.0a42 * update docs * rename to spacy.TextCatEnsemble.v2 * spacy.TextCatEnsemble.v1 in legacy * cleanup * small fix * update to 3.0.0rc2 * fix import that got lost in merge * cursed IDE * fix two typos	2020-10-18 14:50:41 +02:00
Ines Montani	5a6ed01ce0	Merge pull request #6262 from adrianeboyd/bugfix/template-en-vectors	2020-10-16 15:38:08 +02:00
Adriane Boyd	c8d04b79e2	Sort and add vectors for langs without transformers	2020-10-16 08:25:16 +02:00
Adriane Boyd	2fbd43c603	Use core lg models as vectors models in quickstart	2020-10-16 08:17:53 +02:00
Jan Margeta	1ad2213349	Fix TokenPatternSchema pattern field validation Empty pattern field should be considered invalid This is fixed by replacing minItems with min_items as described in Pydantic docs: https://pydantic-docs.helpmanual.io/usage/schema/	2020-10-16 00:41:21 +02:00
Borijan Georgievski	2311192ba1	Include Macedonian language (#6230 ) * Include Macedonian language * Fix indentation at char_classes.py * Fix indentation at char_classes.py * Add Macedonian tests, update lex_attrs and char_classes * Import unicode literals for python 2	2020-10-15 15:55:01 +02:00
Ines Montani	ff4267d181	Fix success message [ci skip]	2020-10-15 14:42:08 +02:00
Ines Montani	10611bf56a	Increment version [ci skip]	2020-10-15 13:30:11 +02:00
Ines Montani	4e17ddf75e	Merge pull request #6256 from adrianeboyd/bugfix/docs-to-json-raw	2020-10-15 10:35:01 +02:00
Ines Montani	b1d568a4df	Tidy up tests	2020-10-15 10:20:21 +02:00
Ines Montani	d165af26be	Auto-format [ci skip]	2020-10-15 10:08:53 +02:00
Adriane Boyd	a93d42861d	Use null raw for has_unknown_spaces in docs_to_json	2020-10-15 09:57:54 +02:00
Ines Montani	5665a21517	Tidy up	2020-10-15 09:30:32 +02:00
Ines Montani	5d62499266	Fix tests	2020-10-15 09:29:15 +02:00
Ines Montani	178760855f	Merge branch 'develop' into master-tmp	2020-10-15 09:06:03 +02:00
Ines Montani	bc85b12e6d	Merge pull request #6249 from svlandeg/feature/batch-tests	2020-10-15 08:57:56 +02:00
svlandeg	0796401c19	call NumpyOps instead of get_current_ops()	2020-10-14 16:55:00 +02:00
svlandeg	44e14ccae8	one more losses fix	2020-10-14 15:11:34 +02:00
svlandeg	0aa8851878	always return losses	2020-10-14 15:00:49 +02:00
svlandeg	e94a21638e	adding tests for trained models to ensure predict reproducibility	2020-10-13 21:07:13 +02:00
svlandeg	ede979d42f	formattting	2020-10-13 18:53:17 +02:00
svlandeg	ff83bfae3f	naming	2020-10-13 18:52:37 +02:00
svlandeg	6ccacff54e	add tests for individual spacy layers	2020-10-13 18:50:07 +02:00
svlandeg	c23041ae60	component tests single or multiple prediction	2020-10-13 16:26:53 +02:00
Ines Montani	1f49300862	Update transformer recommendations [ci skip]	2020-10-13 15:41:17 +02:00
Sofie Van Landeghem	f8a1c1afd6	avoid dropout at runtime (#6247 )	2020-10-13 14:39:59 +02:00
Ines Montani	86d648740f	Fix morph representation in Doc.to_json	2020-10-13 11:39:03 +02:00
Ines Montani	7f92a5ee6a	Update spacy/lang/ta/examples.py	2020-10-13 11:03:35 +02:00
Ines Montani	a0e12c136b	Increment version [ci skip]	2020-10-13 10:00:53 +02:00
Ines Montani	f090f39f17	Merge pull request #6245 from svlandeg/bugfix/else bugfix in _pipe	2020-10-13 09:59:06 +02:00
svlandeg	1f465bea18	if-else	2020-10-13 09:27:19 +02:00
svlandeg	40276fd3be	update NEL docs after latest refactor	2020-10-12 11:41:27 +02:00
Ines Montani	4fa967ea84	Increment version [ci skip]	2020-10-11 13:10:58 +02:00
Ines Montani	ab890a35f9	Make console logger table more compact	2020-10-11 12:55:46 +02:00
Ines Montani	99606e46fe	Relax meta.json schema [ci skip]	2020-10-11 12:30:57 +02:00
svlandeg	3a505e7e14	small edit to ensure the new word was indeed new	2020-10-10 21:05:28 +02:00
svlandeg	68d79796c6	add test for vocab after serializing KB	2020-10-10 20:59:48 +02:00
Ines Montani	539b0c10da	Tidy up and auto-format	2020-10-10 19:14:48 +02:00
Ines Montani	bfa3931c9d	Revert added_strings change (#6236 )	2020-10-10 18:55:07 +02:00
Ines Montani	796f8b9424	Increment version	2020-10-09 18:00:27 +02:00
Ines Montani	525f798841	Fix typo in test	2020-10-09 18:00:21 +02:00
Ines Montani	8ac5f22253	Adjust error message	2020-10-09 18:00:16 +02:00
svlandeg	08cb085f6c	Merge remote-tracking branch 'upstream/develop' into fix/various	2020-10-09 17:01:27 +02:00
Ines Montani	b7cb9d95e4	Merge pull request #6229 from svlandeg/bugfix/disabled	2020-10-09 16:05:11 +02:00
svlandeg	e972ecba72	add utf8 encoding for opening file	2020-10-09 16:03:14 +02:00
Ines Montani	9fb3244672	Merge pull request #6231 from adrianeboyd/feature/include-static-vectors	2020-10-09 15:54:52 +02:00
svlandeg	040c7c0541	fix get_dim calls in build_simple_cnn_text_classifier	2020-10-09 15:40:58 +02:00
Adriane Boyd	727370c633	Remove Span._recalculate_indices Remove `Span._recalculate_indices`, which is a remnant from the deprecated `Span.merge`.	2020-10-09 14:42:51 +02:00
svlandeg	853edace37	fix MultiHashEmbed example in documentation	2020-10-09 14:11:06 +02:00
svlandeg	06b9d213fd	formatting	2020-10-09 12:19:47 +02:00
svlandeg	2cafba5f50	shorten error message for clarity	2020-10-09 12:17:35 +02:00
Ines Montani	4771a10503	Make test more explicit [ci skip]	2020-10-09 12:15:26 +02:00
Ines Montani	cc3646b06c	Add xfailing test for peculiar spans failure [ci skip]	2020-10-09 12:10:25 +02:00
svlandeg	8316bc7d4a	bugfix DisabledPipes	2020-10-09 12:06:20 +02:00
svlandeg	18dfb27985	Add custom error when evaluation throws a KeyError	2020-10-09 12:05:33 +02:00
Adriane Boyd	39aabf50ab	Also rename to include_static_vectors in CharEmbed	2020-10-09 11:54:48 +02:00
Florijan Stamenković	18f5c309dc	Fix Issue 6207 (#6208 ) * Regression test for issue 6207 * Fix issue 6207 * Sign contributor agreement * Minor adjustments to test Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2020-10-09 10:14:40 +02:00
Duygu Altinok	80fb1bffc9	Ordinal numbers for Turkish (#6142 ) * minor ordinal number addition * fixed typo * added corresponding lexical test	2020-10-09 10:13:15 +02:00
Duygu Altinok	2fad279a44	Turkish language syntax iterators (#6191 ) * added tr_vocab to config * basic test * added syntax iterator to Turkish lang class * first version for Turkish syntax iter, without flat * added simple tests with nmod, amod, det * more tests to amod and nmod * separated noun chunks and parser test * rearrangement after nchunk parser separation * added recursive NPs * tests with complicated recursive NPs * tests with conjed NPs * additional tests for conj NP * small modification for shaving off conj from NP * added tests with flat * more tests with flat * added examples with flats conjed * added inner func for flat trick * corrected parse Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2020-10-09 10:10:22 +02:00
Sofie Van Landeghem	d093d6343b	TrainablePipe (#6213 ) * rename Pipe to TrainablePipe * split functionality between Pipe and TrainablePipe * remove unnecessary methods from certain components * cleanup * hasattr(component, "pipe") should be sufficient again * remove serialization and vocab/cfg from Pipe * unify _ensure_examples and validate_examples * small fixes * hasattr checks for self.cfg and self.vocab * make is_resizable and is_trainable properties * serialize strings.json instead of vocab * fix KB IO + tests * fix typos * more typos * _added_strings as a set * few more tests specifically for _added_strings field * bump to 3.0.0a36	2020-10-08 21:33:49 +02:00
Ines Montani	8ff73f04db	Fix morph in Doc.to_json	2020-10-08 14:44:35 +02:00
Ines Montani	064575d79d	Merge pull request #6216 from svlandeg/feature/nel-initialize	2020-10-08 11:14:12 +02:00
svlandeg	3e2e1fd323	cleanup	2020-10-08 10:37:32 +02:00
svlandeg	eaf5c265cb	set_kb method for entity_linker	2020-10-08 10:34:01 +02:00
Ines Montani	010956d493	Clear rule-based components on initialize	2020-10-08 09:51:31 +02:00
Baranitharan	d6037c1860	added sentence	2020-10-08 08:22:58 +05:30
Baranitharan	81afe9b19d	Update examples.py	2020-10-08 08:17:25 +05:30
Sofie Van Landeghem	241cd112f5	add reenabled pipe names back to the meta before serializing (#6219 )	2020-10-08 00:44:16 +02:00
Sofie Van Landeghem	2998131416	Reproducibility for TextCat and Tok2Vec (#6218 ) * ensure fixed seed in HashEmbed layers * forgot about the joys of python 2	2020-10-08 00:43:46 +02:00
svlandeg	efedccea8d	fix tests	2020-10-07 15:29:52 +02:00
svlandeg	6b8bdb2d39	add init_config to nlp.create_pipe	2020-10-07 14:58:16 +02:00
svlandeg	33c2d4af16	move kb_loader to initialize for NEL instead of constructor	2020-10-07 14:56:00 +02:00
Wannaphong Phatthiyaphaibun	9fc8392b38	Add Thai tag map (LST20 Corpus) (#6163 ) * Add Thai tag map (LST20 Corpus) By @korakot * Update tag_map.py * Update tag_map.py * Update tag_map.py	2020-10-07 11:12:01 +02:00
Duygu Altinok	7e821c2776	Turkish language syntax iterators (#6191 ) * added tr_vocab to config * basic test * added syntax iterator to Turkish lang class * first version for Turkish syntax iter, without flat * added simple tests with nmod, amod, det * more tests to amod and nmod * separated noun chunks and parser test * rearrangement after nchunk parser separation * added recursive NPs * tests with complicated recursive NPs * tests with conjed NPs * additional tests for conj NP * small modification for shaving off conj from NP * added tests with flat * more tests with flat * added examples with flats conjed * added inner func for flat trick * corrected parse Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2020-10-07 11:07:52 +02:00
Duygu Altinok	2ce6fc2611	Turkish tag map and morph rules addition (#6141 ) * feat: added turkish tag map * feat: morph rules cconj and sconj * feat: more conjuncts * feat: added popular postpositions * feat: added adverbs * feat: added personal pronouns * feat: added reflexive pronouns * minor: corrected case capital * minor: fixed comma typo * feat: added indef pronouns * feat: added dict iter * fixed comma typo * updated language class with tag map and morph * use default tag map instead * removed tag map	2020-10-07 10:27:36 +02:00
Duygu Altinok	b95a11dd95	Ordinal numbers for Turkish (#6142 ) * minor ordinal number addition * fixed typo * added corresponding lexical test	2020-10-07 10:25:37 +02:00
Rahul Gupta	1a00bff06d	Hindi: Adds tests for lexical attributes (norm and like_num) (#5829 ) * Hindi: Adds tests for lexical attributes (norm and like_num) * Signs and sdds the contributor agreement * Add ordinal numbers to be tagged as like_num * Adds alternate pronunciation for 31 and 39	2020-10-07 10:23:32 +02:00
Nuccy90	c809b2c8e7	Update morph_rules.py (#6102 ) * Update morph_rules.py Added "dig" and "dej" ("you" in accusative form) * Create Nuccy90.md * Update Nuccy90.md	2020-10-06 15:14:47 +02:00
Matthew Honnibal	1a500f9717	Set version to v3.0.0a35	2020-10-06 14:19:07 +02:00
Sofie Van Landeghem	fff3f8ccfa	Fix packaging pin (#6212 ) * pin packaging to >=20.0 * ignore spacy-pkuseg in requirements unit test	2020-10-06 14:16:05 +02:00
Matthew Honnibal	cfb9770a94	Fix empty input into StaticVectors layer (#6211 ) * Add test for empty doc(s) * Fix empty check in staticvectors * Remove xfail * Update spacy/ml/staticvectors.py	2020-10-06 14:15:41 +02:00
Florijan Stamenković	9db670b996	Fix Issue 6207 (#6208 ) * Regression test for issue 6207 * Fix issue 6207 * Sign contributor agreement * Minor adjustments to test Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2020-10-06 11:17:37 +02:00
Ines Montani	568e12215d	Merge pull request #6206 from svlandeg/fix/patterns-init	2020-10-06 10:27:23 +02:00
svlandeg	9b4cf7b0b6	update output of debug config command	2020-10-06 09:47:23 +02:00
svlandeg	ff9ac39c88	read entity_ruler patterns with srsly.read_jsonl.v1	2020-10-05 22:50:14 +02:00
Ines Montani	126268ce50	Auto-format [ci skip]	2020-10-05 21:58:18 +02:00
Ines Montani	1a554bdcb1	Update docs and docstring [ci skip]	2020-10-05 21:55:27 +02:00
Ines Montani	9614e53b02	Tidy up and auto-format	2020-10-05 21:55:18 +02:00
Ines Montani	181039bd17	Merge pull request #6205 from explosion/feature/embed-features	2020-10-05 21:49:10 +02:00
Ines Montani	5ba418b08c	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2020-10-05 21:44:01 +02:00
Ines Montani	568617af58	Merge pull request #6202 from explosion/feature/project-spacy-version	2020-10-05 21:40:52 +02:00
Ines Montani	2d0c0134bc	Adjust message [ci skip]	2020-10-05 21:38:23 +02:00
Ines Montani	6abfc2911d	Merge pull request #6203 from adrianeboyd/feature/zh-spacy-pkuseg	2020-10-05 21:35:57 +02:00
Matthew Honnibal	b7e01d2024	Fix quickstart	2020-10-05 21:21:30 +02:00
Matthew Honnibal	ff8b980775	Upd quickstart template	2020-10-05 21:19:41 +02:00
Matthew Honnibal	91d0fbb588	Fix test	2020-10-05 21:13:53 +02:00
Ines Montani	9ca283a899	Merge branch 'develop' into feature/project-spacy-version	2020-10-05 21:06:07 +02:00
Ines Montani	0135f6ed95	Enable commit check via env var	2020-10-05 20:51:15 +02:00
Matthew Honnibal	b392d48e76	Fix test	2020-10-05 20:17:07 +02:00
Ines Montani	be99f1e4de	Remove output dirs before training (#6204 ) * Remove output dirs before training * Re-raise error if cleaning fails	2020-10-05 20:11:16 +02:00
Matthew Honnibal	e50047f1c5	Check lengths match	2020-10-05 20:02:45 +02:00
Ines Montani	582701519e	Remove __release__ flag	2020-10-05 20:00:49 +02:00
Ines Montani	d58fb42707	Add spacy_version option and validation for project.yml	2020-10-05 20:00:42 +02:00
Matthew Honnibal	db84d175c3	Fix test	2020-10-05 19:59:30 +02:00
Matthew Honnibal	cdd2b79b6d	Remove deprecated MultiHashEmbed	2020-10-05 19:58:18 +02:00
Matthew Honnibal	6dcc4a0ba6	Simplify MultiHashEmbed signature	2020-10-05 19:57:45 +02:00
svlandeg	193e0d5a98	add docs for entity_ruler.initialize	2020-10-05 18:04:08 +02:00
svlandeg	3ac3447eee	cleanup	2020-10-05 17:50:37 +02:00
svlandeg	9eb813a35d	Merge remote-tracking branch 'upstream/develop' into fix/patterns-init	2020-10-05 17:49:44 +02:00
Adriane Boyd	f102ef6b54	Read features.msgpack instead of features.pkl	2020-10-05 17:47:39 +02:00
svlandeg	4e3ace4b8c	is_trainable method	2020-10-05 17:43:42 +02:00
Ines Montani	84fedcebab	Make args keyword-only [ci skip] Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-10-05 17:07:35 +02:00
Matthew Honnibal	71e73ed0a6	Merge branch 'develop' into feature/embed-features	2020-10-05 17:00:05 +02:00
Matthew Honnibal	3ee3649b52	Fix augment	2020-10-05 16:59:49 +02:00
Matthew Honnibal	22937d25a9	Merge branch 'develop' into feature/embed-features	2020-10-05 16:42:17 +02:00
Matthew Honnibal	8deed614e9	Fix augment	2020-10-05 16:41:45 +02:00
Matthew Honnibal	4ed3e037df	Fix augment	2020-10-05 16:40:55 +02:00
Matthew Honnibal	9f1bc3f24c	Fix augment	2020-10-05 16:40:23 +02:00
svlandeg	dc06912c76	prevent loss keyerror for non-trainable components	2020-10-05 16:33:28 +02:00
Adriane Boyd	187234648c	Revert back to "default" as default for pkuseg_user_dict	2020-10-05 16:24:28 +02:00
svlandeg	65abd77779	add finish_update to Pipe	2020-10-05 16:23:33 +02:00
Matthew Honnibal	90040aacec	Fix merge	2020-10-05 16:12:01 +02:00
Matthew Honnibal	93a98e8c3e	Merge branch 'develop' into feature/embed-features	2020-10-05 15:51:31 +02:00
Matthew Honnibal	eb9ba61517	Format	2020-10-05 15:29:49 +02:00
Matthew Honnibal	7d93575f35	spacy/tests/	2020-10-05 15:28:12 +02:00
Matthew Honnibal	f4ca9a39cb	spacy/tests/	2020-10-05 15:27:06 +02:00
Matthew Honnibal	f2f1deca66	spacy/tests/	2020-10-05 15:24:33 +02:00
Matthew Honnibal	8ec79ad3fa	Allow configuration of MultiHashEmbed features Update arguments to MultiHashEmbed layer so that the attributes can be controlled. A kind of tricky scheme is used to allow optional specification of the rows. I think it's an okay balance between flexibility and convenience.	2020-10-05 15:22:00 +02:00
Ines Montani	7946fd84bb	Merge pull request #6200 from adrianeboyd/bugfix/vocab-disk-lookups-vectors Always serialize lookups and vectors to disk	2020-10-05 15:15:25 +02:00
Ines Montani	8171e28b20	Remove logging [ci skip] This would be fired on each example, which is wrong	2020-10-05 15:09:52 +02:00
svlandeg	251b3eb4e5	add initialize method for entity_ruler	2020-10-05 14:59:13 +02:00

... 5 6 7 8 9 ...

8739 Commits