spaCy

mirror of https://github.com/explosion/spaCy.git synced 2024-12-27 10:26:35 +03:00

Author	SHA1	Message	Date
Sofie Van Landeghem	8d7af5b2b1	Ensure hyphen in config file works as string value (#7642 ) * add test for serializing '-' in a config file * bump srsly to 2.4.1	2021-04-12 14:35:57 +02:00
Sofie Van Landeghem	27dbbb9903	Bugfix/nel crossing sentence (#7630 ) * ensure each entity gets a KB ID, even when it's not within a sentence * cleanup	2021-04-12 18:08:01 +10:00
Stanislav Schmidt	2516896849	Make vocab update in get_docs deterministic (#7603 ) * Make vocab update in get_docs deterministic The attribute `DocBin.strings` is a set. In `DocBin.get_docs` a given vocab is updated by iterating over this set. Iteration over a python set produces an arbitrary ordering, therefore vocab is updated non-deterministically. When training (fine-tuning) a spacy model, the base model's vocabulary will be updated with the new vocabulary in the training data in exactly the way described above. After serialization, the file `model/vocab/strings.json` will be sorted in an arbitrary way. This prevents reproducible model training. * Revert "Make vocab update in get_docs deterministic" This reverts commit `d6b87a2f55`. * Sort strings in StringStore serialization Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2021-04-09 11:53:13 +02:00
Adriane Boyd	8008e2f75b	Use morph hash in lemmatizer cache key (#7690 ) Use the morph hash rather than the `MorphAnalysis` object in the cache key so that the `Lemmatizer` can be pickled.	2021-04-08 13:22:38 +02:00
Sofie Van Landeghem	204c2f116b	Extend score_spans for overlapping & non-labeled spans (#7209 ) * extend span scorer with consider_label and allow_overlap * unit test for spans y2x overlap * add score_spans unit test * docs for new fields in scorer.score_spans * rename to include_label * spell out if-else for clarity * rename to 'labeled' Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2021-04-08 12:19:17 +02:00
broaddeep	ee159b8543	Support match alignments (#7321 ) * Support match alignments * change naming from match_alignments to with_alignments, add conditional flow if with_alignments is given, validate with_alignments, add related test case * remove added errors, utilize bint type, cleanup whitespace * fix no new line in end of file * Minor formatting * Skip alignments processing if as_spans is set * Add with_alignments to Matcher API docs * Update website/docs/api/matcher.md Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2021-04-08 18:10:14 +10:00
graue70	81fd595223	Fix __add__ method of PRFScore (#7557 ) * Add failing test for PRFScore * Fix erroneous implementation of __add__ * Simplify constructor Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2021-04-08 17:34:14 +10:00
Adriane Boyd	348d1829c7	Preserve user data for DependencyMatcher on spans (#7528 ) * Preserve user data for DependencyMatcher on spans * Clean underscore in test * Modify test to use extensions stored in user data	2021-03-30 12:26:22 +02:00
Adriane Boyd	27a48f2802	Fix/update extension copying in Span.as_doc and Doc.from_docs (#7574 ) * Adjust custom extension data when copying user data in `Span.as_doc()` * Restrict `Doc.from_docs()` to adjusting offsets for custom extension data * Update test to use extension * (Duplicate bug fix for character offset from #7497)	2021-03-30 09:49:12 +02:00
Adriane Boyd	139f655f34	Merge doc.spans in Doc.from_docs() (#7497 ) Merge data from `doc.spans` in `Doc.from_docs()`. * Fix internal character offset set when merging empty docs (only affects tokens and spans in `user_data` if an empty doc is in the list of docs)	2021-03-29 22:34:01 +11:00
Adriane Boyd	d59f968d08	Keep sent starts without parse in retokenization (#7424 ) In the retokenizer, only reset sent starts (with `set_children_from_head`) if the doc is parsed. If there is no parse, merged tokens have the unset `token.is_sent_start == None` by default after retokenization.	2021-03-29 22:32:00 +11:00
Adriane Boyd	deffc3a532	Update package requirements tests (#7409 ) * Add hypothesis to packages skipped in version check * Add numpy back to tests following `2df1ab8a`	2021-03-11 16:24:31 +01:00
Sofie Van Landeghem	932887b950	textcat scoring fix and multi_label docs (#6974 ) * add multi-label textcat to menu * add infobox on textcat API * add info to v3 migration guide * small edits * further fixes in doc strings * add infobox to textcat architectures * add textcat_multilabel to overview of built-in components * spelling * fix unrelated warn msg * Add textcat_multilabel to quickstart [ci skip] * remove separate documentation page for multilabel_textcategorizer * small edits * positive label clarification * avoid duplicating information in self.cfg and fix textcat.score * fix multilabel textcat too * revert threshold to storage in cfg * revert threshold stuff for multi-textcat Co-authored-by: Ines Montani <ines@ines.io>	2021-03-09 23:04:22 +11:00
Adriane Boyd	3f3e8110dc	Fix lowercase augmentation (#7336 ) * Fix aborted/skipped augmentation for `spacy.orth_variants.v1` if lowercasing was enabled for an example * Simplify `spacy.orth_variants.v1` for `Example` vs. `GoldParse` * Preserve reference tokenization in `spacy.lower_case.v1`	2021-03-09 14:02:32 +11:00
Sofie Van Landeghem	cd70c3cb79	Fixing pretrain (#7342 ) * initialize NLP with train corpus * add more pretraining tests * more tests * function to fetch tok2vec layer for pretraining * clarify parameter name * test different objectives * formatting * fix check for static vectors when using vectors objective * clarify docs * logger statement * fix init_tok2vec and proc.initialize order * test training after pretraining * add init_config tests for pretraining * pop pretraining block to avoid config validation errors * custom errors	2021-03-09 14:01:13 +11:00
svlandeg	d900c55061	consistently use registry as callable	2021-03-02 17:56:28 +01:00
Sofie Van Landeghem	212f0e779e	Support doc.spans in Example.from_dict (#7197 ) * add support for spans in Example.from_dict * add unit tests * update error to E879	2021-03-03 01:12:54 +11:00
Ines Montani	635ae55b74	Merge pull request #7237 from adrianeboyd/bugfix/is-cython-func-7224	2021-03-03 00:05:16 +11:00
Adriane Boyd	0efb7413f9	Use make_tempdir instead	2021-03-01 17:54:14 +01:00
Adriane Boyd	e9f7f9a4bc	Fix is_cython_func for additional imported code * Fix `is_cython_func` for imported code loaded under `python_code` module name * Add `make_named_tempfile` context manager to test utils to test loading of imported code * Add test for validation of `initialize` params in custom module	2021-03-01 16:37:39 +01:00
Sofie Van Landeghem	dd99872bb0	Fix spans weak ref in doc copy (#7225 ) * failing unit test * ensure that doc.spans refers to the copied doc, not the old * add type info	2021-02-28 12:32:48 +11:00
Sofie Van Landeghem	b92f81d5da	fix NEL config and IO, and n_sents functionality (#7100 ) * fix NEL config and IO, and n_sents functionality * add docs * fix test	2021-02-22 14:49:52 +11:00
Sofie Van Landeghem	113e8d082b	only evaluate named entities for NEL if there is a corresponding gold span (#7074 )	2021-02-22 11:06:50 +11:00
Boian Tzonev	cca8651fc8	Bulgarian tokenizer exceptions (#7114 ) * [Bulgarian] Add tokenizer exceptions and like_num for Bulgarian * [Bulgarian] Add tokenizer exceptions and like_num for Bulgarian	2021-02-19 19:19:19 +01:00
Sofie Van Landeghem	709c9e75af	span.ent only returns first sentence (#7084 ) * return first sentence when span contains sentence boundary * docs fix * small fixes * cleanup	2021-02-19 23:02:38 +11:00
Ines Montani	f4f46b617f	Preserve sourced components in fill-config (fixes #7055 ) (#7058 )	2021-02-14 14:02:14 +11:00
Matthew Honnibal	0fb8d437c0	Fix sentence fragments bug (#7056 , #7035 ) (#7057 ) * Add test for #7035 * Update test for issue 7056 * Fix test * Fix transitions method used in testing * Fix state eol detection when rebuffer * Clean up redundant fix	2021-02-14 13:38:13 +11:00
Ines Montani	9ba715ed16	Tidy up and auto-format	2021-02-13 12:55:56 +11:00
Ines Montani	34ee0fbd70	Merge pull request #7011 from Shumie82/master	2021-02-13 12:30:42 +11:00
Ines Montani	e583050547	Merge pull request #7039 from svlandeg/debug	2021-02-13 11:53:41 +11:00
Ines Montani	6c450decfc	Fix punctuation settings and add to initialize tests	2021-02-13 11:51:21 +11:00
svlandeg	03b4ec7d7f	fix typo	2021-02-12 14:30:16 +01:00
Adriane Boyd	5e47a54d29	Include noun chunks method when pickling Vocab	2021-02-12 13:27:46 +01:00
svlandeg	278e9eaa14	remove ner	2021-02-11 21:08:04 +01:00
svlandeg	ebeedfc70b	regression test for 7029	2021-02-11 20:56:48 +01:00
Ines Montani	26bf642afd	Fix issue #7019 : Handle None scores in evaluate printer (#7026 )	2021-02-11 16:45:23 +11:00
Ines Montani	6b9026a219	Merge pull request #7000 from explosion/feature/project-yml-overrides Support env vars and CLI overrides for project.yml	2021-02-11 12:31:45 +11:00
Ines Montani	ad9ce3c8f6	Fix issue #6950 : allow pickling Tok2Vec with listeners	2021-02-11 11:37:39 +11:00
Peter Baumann	61b04a70d5	Run PhraseMatcher on Spans (#6918 ) * Add regression test * Run PhraseMatcher on Spans * Add test for PhraseMatcher on Spans and Docs * Add SCA * Add test with 3 matches in Doc, 1 match in Span * Update docs * Use doc.length for find_matches in tokenizer Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2021-02-10 23:43:32 +11:00
Ines Montani	21176c69b0	Update and add test	2021-02-10 14:12:00 +11:00
Ines Montani	c08b3f294c	Support env vars and CLI overrides for project.yml	2021-02-10 13:45:27 +11:00
melonwater211	a7977b5143	The test `spacy/tests/vocab_vectors/test_lexeme.py::test_vocab_lexeme_add_flag_auto_id` seems to fail occasionally when the test suite is run in a random order. (#6956 ) ```python def test_vocab_lexeme_add_flag_auto_id(en_vocab): is_len4 = en_vocab.add_flag(lambda string: len(string) == 4) assert en_vocab["1999"].check_flag(is_len4) is True assert en_vocab["1999"].check_flag(IS_DIGIT) is True assert en_vocab["199"].check_flag(is_len4) is False > assert en_vocab["199"].check_flag(IS_DIGIT) is True E assert False is True E + where False = <built-in method check_flag of spacy.lexeme.Lexeme object at 0x7fa155c36840>(3) E + where <built-in method check_flag of spacy.lexeme.Lexeme object at 0x7fa155c36840> = <spacy.lexeme.Lexeme object at 0x7fa155c36840>.check_flag spacy/tests/vocab_vectors/test_lexeme.py:49: AssertionError ``` > `pytest==6.1.1` > > `numpy==1.19.2` > > `Python version: 3.8.3` To reproduce the error, run `pytest --random-order-bucket=global --random-order-seed=170158 -v spacy/tests` If `test_vocab_lexeme_add_flag_auto_id` is run after `test_vocab_lexeme_add_flag_provided_id`, it fails. It seems like `test_vocab_lexeme_add_flag_provided_id` uses the `IS_DIGIT` bit for testing purposes but does not reset the bit. This solution seems to work but, if anyone has a better fix, please let me know and I will integrate it.	2021-02-07 07:51:34 +08:00
René Octavio Queiroz Dias	999ff03b19	fix: Fix textcat labels to expect a Optional[Iterable[str]] instead of Optional[Dict] (#6911 ) * docs: Add agreement * bug: Regression test Issue #6908 * fix: Changed from Dict to Iterable[str] Fix #6908 * Update test to use make_tempdir * fix: Fix WindowsPath error Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2021-02-04 23:37:13 +01:00
Ines Montani	e6accb3a9e	Tidy up and auto-format	2021-01-30 12:52:33 +11:00
Ines Montani	30765674d0	Merge branch 'master' into develop	2021-01-30 12:20:28 +11:00
Ines Montani	bc089b693c	Update tests	2021-01-29 19:38:09 +11:00
Ines Montani	01ecfbcc45	Merge branch 'develop' into feature/replace-listeners	2021-01-29 15:57:32 +11:00
Ines Montani	911dfcccfc	Add option to replace listeners for sourced components	2021-01-29 15:57:04 +11:00
Adriane Boyd	fcce3600ed	Forbid OP matching 2+ tokens in DependencyMatcher (#6824 ) Instead of silently using only the first token in each matched span: * Forbid `OP: ?//+` through `DependencyMatcher` validation As a fail-safe, add warning if a token match that's not exactly one token long is found by a token pattern.	2021-01-29 08:52:01 +08:00
Sofie Van Landeghem	24a697abb8	avoid empty aliases and improve UX and docs (#6840 )	2021-01-29 08:51:40 +08:00
Sofie Van Landeghem	837a4f53c2	Error handling in nlp.pipe (#6817 ) * add error handler for pipe methods * add unit tests * remove pipe method that are the same as their base class * have Language keep track of a default error handler * cleanup * formatting * small refactor * add documentation	2021-01-29 08:51:21 +08:00
Adriane Boyd	4096a79de7	Add alignment mode error and fix Doc.char_span docs (#6820 ) * Raise an error on an unrecognized alignment mode rather than defaulting to `strict` * Fix the `Doc.char_span` API doc alignment mode details	2021-01-27 23:40:42 +11:00
Sofie Van Landeghem	6b68ad027b	Fix beam NER resizing (#6834 ) * move label check to sub methods * add tests	2021-01-27 23:39:14 +11:00
Ines Montani	5ed51c9dd2	Merge pull request #6828 from explosion/master-tmp	2021-01-27 23:05:46 +11:00
Adriane Boyd	d17afb4826	Add Spanish rule-based lemmatizer (#6833 ) * Initial Spanish lemmatizer * Handle merged verb+pron(s) multi-word tokens * Use VERB for AUX rule lookup * Add morph to lemma cache key * Fix aux lookups, minor refactoring * Improve verb+pron handling * Move verb+pron handling into its own method * Check for exceptions (primarily for se) * Collect pronouns in the same (not reversed) order * Only add modified possible lemmas	2021-01-27 19:21:35 +08:00
Ines Montani	80ba9eaf7d	Fix test	2021-01-27 21:29:02 +11:00
Ines Montani	230e651ad6	Merge branch 'develop' into master-tmp	2021-01-27 13:26:29 +11:00
Matthew Honnibal	68b1c2984d	Test labels are added implicitly	2021-01-27 12:52:29 +11:00
Dhruv Naik	e7db07a0b9	Fix Span.char_span bug (#6816 ) * Create dhruvrnaik.md * add test for issue #6815 * bugfix for issue #6815 * update dhruvrnaik.md * add span.vector test for #6815	2021-01-26 15:50:37 +08:00
Adriane Boyd	2263bc7b28	Update develop from master for v3.0.0rc5 (#6811 ) * Fix `spacy.util.minibatch` when the size iterator is finished (#6745) * Skip 0-length matches (#6759) Add hack to prevent matcher from returning 0-length matches. * support IS_SENT_START in PhraseMatcher (#6771) * support IS_SENT_START in PhraseMatcher * add unit test and friendlier error * use IDS.get instead * ensure span.text works for an empty span (#6772) * Remove unicode_literals Co-authored-by: Santiago Castro <bryant@montevideo.com.uy> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2021-01-26 14:52:45 +11:00
Matthew Honnibal	f049df1715	Revert "Set annotations in update" (#6810 ) * Revert "Set annotations in update (#6767)" This reverts commit `e680efc7cc`. * Fix version * Update spacy/pipeline/entity_linker.py * Update spacy/pipeline/entity_linker.py * Update spacy/pipeline/tagger.pyx * Update spacy/pipeline/tok2vec.py * Update spacy/pipeline/tok2vec.py * Update spacy/pipeline/transition_parser.pyx * Update spacy/pipeline/transition_parser.pyx * Update website/docs/api/multilabel_textcategorizer.md * Update website/docs/api/tok2vec.md * Update website/docs/usage/layers-architectures.md * Update website/docs/usage/layers-architectures.md * Update website/docs/api/transformer.md * Update website/docs/api/textcategorizer.md * Update website/docs/api/tagger.md * Update spacy/pipeline/entity_linker.py * Update website/docs/api/sentencerecognizer.md * Update website/docs/api/pipe.md * Update website/docs/api/morphologizer.md * Update website/docs/api/entityrecognizer.md * Update spacy/pipeline/entity_linker.py * Update spacy/pipeline/multitask.pyx * Update spacy/pipeline/tagger.pyx * Update spacy/pipeline/tagger.pyx * Update spacy/pipeline/textcat.py * Update spacy/pipeline/textcat.py * Update spacy/pipeline/textcat.py * Update spacy/pipeline/tok2vec.py * Update spacy/pipeline/trainable_pipe.pyx * Update spacy/pipeline/trainable_pipe.pyx * Update spacy/pipeline/transition_parser.pyx * Update spacy/pipeline/transition_parser.pyx * Update website/docs/api/entitylinker.md * Update website/docs/api/dependencyparser.md * Update spacy/pipeline/trainable_pipe.pyx	2021-01-25 22:18:45 +08:00
muratjumashev	87168eb81f	Add tests	2021-01-24 20:56:16 +06:00
Sofie Van Landeghem	5ace559201	ensure span.text works for an empty span (#6772 )	2021-01-21 23:18:46 +08:00
Sofie Van Landeghem	d93cd3b7c0	remove artificially duplicated test [ci skip]	2021-01-21 10:53:16 +01:00
Sofie Van Landeghem	fdf8c77630	support IS_SENT_START in PhraseMatcher (#6771 ) * support IS_SENT_START in PhraseMatcher * add unit test and friendlier error * use IDS.get instead	2021-01-21 09:59:17 +01:00
Sofie Van Landeghem	e680efc7cc	Set annotations in update (#6767 ) * bump to 3.0.0rc4 * do set_annotations in component update calls * update docs and remove set_annotations flag * fix EL test	2021-01-20 11:49:25 +11:00
Adriane Boyd	bc7d83d4be	Skip 0-length matches (#6759 ) Add hack to prevent matcher from returning 0-length matches.	2021-01-19 07:38:11 +08:00
Sofie Van Landeghem	fed8f48965	raise NotImplementedError when noun_chunks iterator is not implemented (#6711 ) * raise NotImplementedError when noun_chunks iterator is not implemented * bring back, fix and document span.noun_chunks * formatting Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2021-01-17 19:56:05 +08:00
Adriane Boyd	bf0cdae8d4	Add token_splitter component (#6726 ) * Add long_token_splitter component Add a `long_token_splitter` component for use with transformer pipelines. This component splits up long tokens like URLs into smaller tokens. This is particularly relevant for pretrained pipelines with `strided_spans`, since the user can't change the length of the span `window` and may not wish to preprocess the input texts. The `long_token_splitter` splits tokens that are at least `long_token_length` tokens long into smaller tokens of `split_length` size. Notes: * Since this is intended for use as the first component in a pipeline, the token splitter does not try to preserve any token annotation. * API docs to come when the API is stable. * Adjust API, add test * Fix name in factory	2021-01-17 19:54:41 +08:00
Adriane Boyd	9328dd5625	Handle unset token.morph in Morphologizer (#6704 ) * Handle unset token.morph in Morphologizer Handle unset `token.morph` in `Morphologizer.initialize` and `Morphologizer.get_loss`. If both `token.morph` and `token.pos` are unset, treat the annotation as missing rather than empty. * Add token.has_morph()	2021-01-15 17:20:10 +01:00
Ines Montani	f9e4ac1283	Fix test	2021-01-15 12:51:02 +11:00
Ines Montani	b0b743597c	Tidy up and auto-format	2021-01-15 11:57:36 +11:00
Ines Montani	57369909c0	Merge pull request #6727 from adrianeboyd/chore/update-develop-from-master-rc3	2021-01-15 11:44:28 +11:00
Adriane Boyd	0c936004d1	Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-rc3	2021-01-14 11:49:58 +01:00
Matthew Honnibal	92310a5e26	Merge branch 'develop' into feature/missing-dep	2021-01-14 17:39:01 +11:00
Adriane Boyd	9957ed7897	Override language defaults for null token and URL match (#6705 ) * Override language defaults for null token and URL match When the serialized `token_match` or `url_match` is `None`, override the language defaults to preserve `None` on deserialization. * Fix fixtures in tests	2021-01-14 17:31:29 +11:00
Matthew Honnibal	f277bfdf0f	Add SpanGroup and Graph container types to represent arbitrary annotations (#6696 ) * Draft out initial Spans data structure * Initial span group commit * Basic span group support on Doc * Basic test for span group * Compile span_group.pyx * Draft addition of SpanGroup to DocBin * Add deserialization for SpanGroup * Add tests for serializing SpanGroup * Fix serialization of SpanGroup * Add EdgeC and GraphC structs * Add draft Graph data structure * Compile graph * More work on Graph * Update GraphC * Upd graph * Fix walk functions * Let Graph take nodes and edges on construction * Fix walking and getting * Add graph tests * Fix import * Add module with the SpanGroups dict thingy * Update test * Rename 'span_groups' attribute * Try to fix c++11 compilation * Fix test * Update DocBin * Try to fix compilation * Try to fix graph * Improve SpanGroup docstrings * Add doc.spans to documentation * Fix serialization * Tidy up and add docs * Update docs [ci skip] * Add SpanGroup.has_overlap * WIP updated Graph API * Start testing new Graph API * Update Graph tests * Update Graph * Add docstring Co-authored-by: Ines Montani <ines@ines.io>	2021-01-14 17:30:41 +11:00
svlandeg	fec9b81aa2	Merge remote-tracking branch 'upstream/develop' into feature/missing-dep	2021-01-13 17:46:12 +01:00
svlandeg	ed53bb979d	cleanup	2021-01-13 14:20:05 +01:00
svlandeg	86a4e316b8	fix sent_starts	2021-01-13 13:47:25 +01:00
Ines Montani	31a92b28ae	Merge pull request #6715 from adrianeboyd/feature/before-after-init-callbacks Add initialize.before_init and after_init callbacks	2021-01-13 12:17:00 +11:00
Ines Montani	97d5a7ba99	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2021-01-13 12:03:02 +11:00
Ines Montani	8d6448ccf7	Add config resolver test	2021-01-13 12:02:59 +11:00
svlandeg	232e953b14	pytest.approx with absolute eps	2021-01-12 20:32:57 +01:00
svlandeg	5b598bd1d5	formatting	2021-01-12 17:28:41 +01:00
svlandeg	a581d82f33	introduce token.has_head and refer to MISSING_DEP_ (WIP)	2021-01-12 17:17:06 +01:00
Adriane Boyd	a45d89f09a	Add initialize.before_init and after_init callbacks Add `initialize.before_init` and `initialize.after_init` callbacks to the config. The `initialize.before_init` callback is a place to implement one-time tokenizer customizations that are then saved with the model.	2021-01-12 13:07:44 +01:00
Adriane Boyd	ad43cbb042	Sync missing and misaligned values in Tagger loss (#6689 ) Use `None` for both missing and misaligned annotation in `Tagger.get_loss`, reverting to the default missing value in the loss function.	2021-01-10 11:30:37 +11:00
svlandeg	dd12c6c8fd	allow missing information in deps and heads annotations	2021-01-07 19:10:32 +01:00
Sofie Van Landeghem	75d9019343	Fix types of Tok2Vec encoding architectures (#6442 ) * fix TorchBiLSTMEncoder documentation * ensure the types of the encoding Tok2vec layers are correct * update references from v1 to v2 for the new architectures	2021-01-07 16:39:27 +11:00
ophelielacroix	e3222fdec9	Add (noun chunks) syntax iterators for Danish (#6246 ) * add syntax iterators for danish * add test noun chunks for danish syntax iterators * add contributor agreement * update da syntax iterators to remove nested chunks * add tests for da noun chunks * Fix test * add missing import * fix example * Prevent overlapping noun chunks Prevent overlapping noun chunks by tracking the end index of the previous noun chunk span. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2021-01-07 16:33:00 +11:00
Sofie Van Landeghem	8c1a23209f	Getting scores out of beam_parser (#6684 ) * clean up of ner tests * beam_parser tests * implement get_beam_parses and scored_parses for the dep parser * we don't have to add the parse if there are no arcs	2021-01-07 16:28:27 +11:00
Sofie Van Landeghem	402dbc5bae	Getting scores out of beam_ner (#6575 ) * small fixes and formatting * bring test_issue4313 up-to-date, currently fails * formatting * add get_beam_parses method back * add scored_ents function * delete tag map	2021-01-06 12:02:32 +01:00
Sofie Van Landeghem	6f7e7d88b9	remove cause without apostrophe from norm exceptions (#6636 )	2021-01-06 12:30:30 +08:00
Adriane Boyd	bf9096437e	Set default lemmas in retokenizer (#6667 ) Instead of unsetting lemmas on retokenized tokens, set the default lemmas to: * merge: concatenate any existing lemmas with `SPACY` preserved * split: use the new `ORTH` values if lemmas were previously set, otherwise leave unset	2021-01-06 12:29:44 +08:00
Adriane Boyd	0041dfbc7f	Use special matcher for exceptions with spaces (#6668 ) Use the special cases phrase matcher for exceptions that include space characters so that exceptions including spaces are supported.	2021-01-06 12:05:10 +08:00
Sofie Van Landeghem	afc5714d32	multi-label textcat component (#6474 ) * multi-label textcat component * formatting * fix comment * cleanup * fix from #6481 * random edit to push the tests * add explicit error when textcat is called with multi-label gold data * fix error nr * small fix	2021-01-06 13:07:14 +11:00
Ines Montani	81f018fb67	Merge pull request #6671 from explosion/chore/tidy-autoformat Tidy up and auto-format	2021-01-05 14:45:31 +11:00
Ines Montani	224a3590e9	Merge pull request #6654 from svlandeg/chore/tests-cleanup Unskipping tests	2021-01-05 13:53:40 +11:00
Ines Montani	c4993f16d0	Merge pull request #6651 from svlandeg/bugfix/cli_info	2021-01-05 13:44:26 +11:00
Ines Montani	991669c934	Tidy up and auto-format	2021-01-05 13:41:53 +11:00
Adriane Boyd	b57be94c78	Fix memory issues in Language.evaluate (#6386 ) * Fix memory issues in Language.evaluate Reset annotation in predicted docs before evaluating and store all data in `examples`. * Minor refactor to docs generator init * Fix generator expression * Fix final generator check * Refactor pipeline loop * Handle examples generator in Language.evaluate * Add test with generator * Use make_doc	2020-12-31 10:45:50 +11:00
svlandeg	a6a68da673	unskipping tests with python >= 3.6	2020-12-30 18:46:43 +01:00
svlandeg	d5ff0fecf8	add docs	2020-12-30 14:01:13 +01:00
svlandeg	c74ab6a313	fix imports	2020-12-30 12:40:12 +01:00
svlandeg	712a78b74a	add simple unit test	2020-12-30 12:35:26 +01:00
Adriane Boyd	5ca57d8221	Add logger warning when serializing user hooks (#6595 ) Add a warning that user hooks are lost on serialization. Add a `user_hooks` exclude to skip the warning with pickle.	2020-12-29 11:54:32 +01:00
Yosi	cf52510631	Add Amharic አማርኛ Language support (#6583 ) * Add Amharic to space * clean up * Add some PRON_LEMMA * add Tigrinya support * remove text_noun_chunks * Tigrinya Support * added some more details for ti * fix unit test * add amharic char range * changes from review * amharic and tigrinya share same unicode block * get rid of _amharic/_tigrinya in char_classes Co-authored-by: Josiah Solomon <jsolomon@meteorcomm.com>	2020-12-22 16:50:34 +01:00
Adriane Boyd	cabd4ae5b1	Use logger.warning instead of logger.warn (#6596 ) Use `logger.warning` instead of deprecated `logger.warn`.	2020-12-21 08:25:10 +08:00
Sofie Van Landeghem	282a3b49ea	Fix parser resizing when there is no upper layer (#6460 ) * allow resizing of the parser model even when upper=False * update from spacy.TransitionBasedParser.v1 to v2 * bugfix	2020-12-18 18:56:57 +08:00
Sofie Van Landeghem	0a923a7915	Tagger robustness (#6580 ) * require labels in taggers * ensure tagger works with incomplete data	2020-12-18 18:51:47 +08:00
Adriane Boyd	1ddf2f39c7	Switch converters to generator functions (#6547 ) * Switch converters to generator functions To reduce the memory usage when converting large corpora, refactor the convert methods to be generator functions. * Update tests	2020-12-15 16:47:16 +08:00
Matthew Honnibal	8656a08777	Add beam_parser and beam_ner components for v3 (#6369 ) * Get basic beam tests working * Get basic beam tests working * Compile _beam_utils * Remove prints * Test beam density * Beam parser seems to train * Draft beam NER * Upd beam * Add hypothesis as dev dependency * Implement missing is-gold-parse method * Implement early update * Fix state hashing * Fix test * Fix test * Default to non-beam in parser constructor * Improve oracle for beam * Start refactoring beam * Update test * Refactor beam * Update nn * Refactor beam and weight by cost * Update ner beam settings * Update test * Add __init__.pxd * Upd test * Fix test * Upd test * Fix test * Remove ring buffer history from StateC * WIP change arc-eager transitions * Add state tests * Support ternary sent start values * Fix arc eager * Fix NER * Pass oracle cut size for beam * Fix ner test * Fix beam * Improve StateC.clone * Improve StateClass.borrow * Work directly with StateC, not StateClass * Remove print statements * Fix state copy * Improve state class * Refactor parser oracles * Fix arc eager oracle * Fix arc eager oracle * Use a vector to implement the stack * Refactor state data structure * Fix alignment of sent start * Add get_aligned_sent_starts method * Add test for ae oracle when bad sentence starts * Fix sentence segment handling * Avoid Reduce that inserts illegal sentence * Update preset SBD test * Fix test * Remove prints * Fix sent starts in Example * Improve python API of StateClass * Tweak comments and debug output of arc eager * Upd test * Fix state test * Fix state test	2020-12-13 09:08:32 +08:00
Ines Montani	9d32e839d3	Merge branch 'develop' into feature/init-config-cpu-gpu	2020-12-10 08:50:53 +11:00
Ines Montani	f2571b5ec4	Merge pull request #6444 from adrianeboyd/chore/update-develop-from-master	2020-12-09 13:09:58 +11:00
Ines Montani	90171f2031	Merge pull request #6528 from svlandeg/feature/pipe_fill_config	2020-12-09 12:01:22 +11:00
Ines Montani	dfaef27f90	Merge pull request #6503 from adrianeboyd/feature/lemmatizer-rule-warning-pos Warn on empty POS for the rule-based lemmatizer	2020-12-09 11:34:16 +11:00
Ines Montani	b85bd63eca	Fix test	2020-12-09 11:24:01 +11:00
Ines Montani	febf71af28	Fix test	2020-12-09 11:23:07 +11:00
Ines Montani	1980203229	Merge branch 'master' into pr/6444	2020-12-09 11:09:40 +11:00
Ines Montani	05a2812ae0	Merge branch 'develop' into pr/6444	2020-12-09 11:04:03 +11:00
Sofie Van Landeghem	cfc72c2995	Bugfix multi-label textcat reproducibility (#6481 ) * add test for multi-label textcat reproducibility * remove positive_label * fix lengths dtype * fix comments * remove comment that we should not have forgotten :-)	2020-12-09 06:29:15 +08:00
svlandeg	8f8a7f1733	returning config in init_config	2020-12-08 17:37:20 +01:00
Sofie Van Landeghem	2c27093c5f	require_cpu functionality (#6336 ) * add require_cpu from Thinc 8.0.0rc2 * add docs * fix test if cupy is not installed	2020-12-08 14:42:40 +08:00
Sofie Van Landeghem	f98a04434a	pretrain architectures (#6451 ) * define new architectures for the pretraining objective * add loss function as attr of the omdel * cleanup * cleanup * shorten name * fix typo * remove unused error	2020-12-08 14:41:03 +08:00
Adriane Boyd	29b058ebdc	Fix spacy when retokenizing cases with affixes (#6475 ) Preserve `token.spacy` corresponding to the span end token in the original doc rather than adjusting for the current offset. * If not modifying in place, this checks in the original document (`doc.c` rather than `tokens`). * If modifying in place, the document has not been modified past the current span start position so the value at the current span end position is valid.	2020-12-08 14:25:56 +08:00
Adriane Boyd	4448680750	Fix alignment for 1-to-1 tokens and lowercasing (#6476 ) * When checking for token alignments, check not only that the tokens are identical but that the character positions are both at the start of a token. It's possible for the tokens to be identical even though the two tokens aren't aligned one-to-one in a case like `["a'", "''"]` vs. `["a", "''", "'"]`, where the middle tokens are identical but should not be aligned on the token level at character position 2 since it's the start of one token but the middle of another. * Use the lowercased version of the token texts to create the character-to-token alignment because lowercasing can change the string length (e.g., for `İ`, see the not-a-bug bug report: https://bugs.python.org/issue34723)	2020-12-08 14:25:16 +08:00
Adriane Boyd	d70950605c	Warn on empty POS for the rule-based lemmatizer Add a warning to the rule-based lemmatizer for any tokens without POS annotation.	2020-12-04 11:46:15 +01:00
Sofie Van Landeghem	d6c616a125	Fixes in test suite (#6457 ) * fix slow test for textcat readers * cleanup test_issue5551 * add explicit score weight * cleanup	2020-12-02 12:57:08 +01:00
Adriane Boyd	53c0fb7431	Only set NORM on Token in retokenizer (#6464 ) * Only set NORM on Token in retokenizer Instead of setting `NORM` on both the token and lexeme, set `NORM` only on the token. The retokenizer tries to set all possible attributes with `Token/Lexeme.set_struct_attr` so that it doesn't have to enumerate which attributes are available for each. `NORM` is the only attribute that's stored on both and for most cases it doesn't make sense to set the global norms based on a individual retokenization. For lexeme-only attributes like `IS_STOP` there's no way to avoid the global side effects, but I think that `NORM` would be better only on the token. * Fix test	2020-11-30 09:35:42 +08:00
Adriane Boyd	03ae77e603	Add SPACY as a Matcher attribute (#6463 )	2020-11-30 09:34:50 +08:00
Adriane Boyd	724831b066	Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master * Update Macedonian for v3 * Update Turkish for v3	2020-11-25 11:49:34 +01:00
Sofie Van Landeghem	2af31a8c8d	Bugfix textcat reproducibility on GPU (#6411 ) * add seed argument to ParametricAttention layer * bump thinc to 7.4.3 * set thinc version range Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2020-11-23 12:29:35 +01:00
Adriane Boyd	320a8b1481	Add ent_id_ to strings serialized with Doc (#6353 )	2020-11-10 20:16:07 +08:00
Adriane Boyd	a7e7d6c6c9	Ignore misaligned in Morphologizer.get_loss (#6363 ) Fix bug where `Morphologizer.get_loss` treated misaligned annotation as `EMPTY_MORPH` rather than ignoring it. Remove unneeded default `EMPTY_MORPH` mappings.	2020-11-10 20:15:09 +08:00
Ines Montani	363ac73c72	Update docs [ci skip]	2020-11-09 12:43:26 +08:00
Adriane Boyd	31de700b0f	Fix on_match callback and remove empty patterns (#6312 ) For the `DependencyMatcher`: * Fix on_match callback so that it is called once per matched pattern * Fix results so that patterns with empty match lists are not returned	2020-11-05 09:16:26 +01:00
Adriane Boyd	1c4df8fd09	Replace pytokenizations with internal alignment (#6293 ) * Replace pytokenizations with internal alignment Replace pytokenizations with internal alignment algorithm that is restricted to only allow differences in whitespace and capitalization. * Rename `spacy.training.align` to `spacy.training.alignment` to contain the `Alignment` dataclass * Implement `get_alignments` in `spacy.training.align` * Refactor trailing whitespace handling * Remove unnecessary exception for empty docs Allow a non-empty whitespace-only doc to be aligned with an empty doc * Remove empty docs exceptions completely	2020-11-03 16:24:38 +01:00
Adriane Boyd	a4b32b9552	Handle missing reference values in scorer (#6286 ) * Handle missing reference values in scorer Handle missing values in reference doc during scoring where it is possible to detect an unset state for the attribute. If no reference docs contain annotation, `None` is returned instead of a score. `spacy evaluate` displays `-` for missing scores and the missing scores are saved as `None`/`null` in the metrics. Attributes without unset states: * `token.head`: relies on `token.dep` to recognize unset values * `doc.cats`: unable to handle missing annotation Additional changes: * add optional `has_annotation` check to `score_scans` to replace `doc.sents` hack * update `score_token_attr_per_feat` to handle missing and empty morph representations * fix bug in `Doc.has_annotation` for normalization of `IS_SENT_START` vs. `SENT_START` * Fix import * Update return types	2020-11-03 15:47:18 +01:00
Adriane Boyd	5d2cb86c34	Fix on_match callback for DependencyMatcher (#6313 ) Fix `DependencyMatcher` so that the callback is called only once per match.	2020-10-31 12:20:27 +01:00
Adriane Boyd	45c9a68828	Identify final Matcher pattern node by quantifier (#6317 ) Modify the internal pattern representation in `Matcher` patterns to identify the final ID state using a unique quantifier rather than a combination of other attributes. It was insufficient to identify the final ID node based on an uninitialized `quantifier` (coincidentally being the same as the `ZERO`) with `nr_attr` as 0. (In addition, it was potentially bug-prone that `nr_attr` was set to 0 even though attrs were allocated.) In the case of `{"OP": "!"}` (a valid, if pointless, pattern), `nr_attr` is 0 and the quantifier is ZERO, so the previous methods for incrementing to the ID node at the end of the pattern weren't able to distinguish the final ID node from the `{"OP": "!"}` pattern.	2020-10-31 12:18:48 +01:00
Duygu Altinok	0e55f806dd	Turkish tokenization improvements (#6268 ) * added single and paired orth variants * added token match * added long text tokenization test * inverted init * normalized lemmas to lowercase * more abbrevs * tests for ordinals and abbrevs * separated period abbvrevs to another list * fiex typo * added ordinal and abbrev tests * added number tests for dates * minor refinement * added inflected abbrevs regex * added percentage and inflection * cosmetics * added token match * added url inflection tests * excluded url tokens from custom pattern * removed url match import	2020-10-29 09:43:17 +01:00
Sofie Van Landeghem	75a202ce65	TextCat updates and fixes (#6263 ) * small fix in example imports * throw error when train_corpus or dev_corpus is not a string * small fix in custom logger example * limit macro_auc to labels with 2 annotations * fix typo * also create parents of output_dir if need be * update documentation of textcat scores * refactor TextCatEnsemble * fix tests for new AUC definition * bump to 3.0.0a42 * update docs * rename to spacy.TextCatEnsemble.v2 * spacy.TextCatEnsemble.v1 in legacy * cleanup * small fix * update to 3.0.0rc2 * fix import that got lost in merge * cursed IDE * fix two typos	2020-10-18 14:50:41 +02:00
Jan Margeta	1ad2213349	Fix TokenPatternSchema pattern field validation Empty pattern field should be considered invalid This is fixed by replacing minItems with min_items as described in Pydantic docs: https://pydantic-docs.helpmanual.io/usage/schema/	2020-10-16 00:41:21 +02:00
Borijan Georgievski	2311192ba1	Include Macedonian language (#6230 ) * Include Macedonian language * Fix indentation at char_classes.py * Fix indentation at char_classes.py * Add Macedonian tests, update lex_attrs and char_classes * Import unicode literals for python 2	2020-10-15 15:55:01 +02:00
Ines Montani	b1d568a4df	Tidy up tests	2020-10-15 10:20:21 +02:00
Ines Montani	d165af26be	Auto-format [ci skip]	2020-10-15 10:08:53 +02:00
Ines Montani	5d62499266	Fix tests	2020-10-15 09:29:15 +02:00
Ines Montani	178760855f	Merge branch 'develop' into master-tmp	2020-10-15 09:06:03 +02:00
svlandeg	0796401c19	call NumpyOps instead of get_current_ops()	2020-10-14 16:55:00 +02:00

1 2 3 4 5 ...

2292 Commits