spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-11-14 14:56:02 +03:00

Author	SHA1	Message	Date
Adriane Boyd	f99d6d5e39	Refactor scoring methods to use registered functions (#8766 ) * Add scorer option to components Add an optional `scorer` parameter to all pipeline components. If a scoring function is provided, it overrides the default scoring method for that component. * Add registered scorers for all components * Add `scorers` registry * Move all scoring methods outside of components as independent functions and register * Use the registered scoring methods as defaults in configs and inits Additional: * The scoring methods no longer have access to the full component, so use settings from `cfg` as default scorer options to handle settings such as `labels`, `threshold`, and `positive_label` * The `attribute_ruler` scoring method no longer has access to the patterns, so all scoring methods are called * Bug fix: `spancat` scoring method is updated to set `allow_overlap` to score overlapping spans correctly * Update Russian lemmatizer to use direct score method * Check type of cfg in Pipe.score * Fix check * Update spacy/pipeline/sentencizer.pyx Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Remove validate_examples from scoring functions * Use Pipe.labels instead of Pipe.cfg["labels"] Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2021-08-10 15:13:39 +02:00
fgaim	ee011ca963	Update Tigrinya ትግርኛ language support (#8900 ) * Add missing punctuation for Tigrinya and Amharic * Fix numeral and ordinal numbers for Tigrinya - Amharic was used in many cases - Also fixed some typos * Update Tigrinya stop-words * Contributor agreement for fgaim * Fix typo in "ti" lang test * Remove multi-word entries from numbers and ordinals	2021-08-10 13:55:08 +02:00
Dimitar Ganev	733ffe439d	Improve the stop words and the tokenizer exceptions in Bulgarian language. (#8862 ) * Add more stop words and Improve the readability * Add and categorize the tokenizer exceptions for `bg` lang * Create syrull.md * Add references for the additional stop words and tokenizer exc abbrs	2021-08-10 13:44:23 +02:00
Adriane Boyd	a79888ed67	Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-v3.1-1	2021-08-09 13:13:13 +02:00
Eduard Zorita	439f30faad	Add stub files for main cython classes (#8427 ) * Add stub files for main API classes * Add contributor agreement for ezorita * Update types for ndarray and hash() * Fix __getitem__ and __iter__ * Add attributes of Doc and Token classes * Overload type hints for Span.__getitem__ * Fix type hint overload for Span.__getitem__ Co-authored-by: Luca Dorigo <dorigoluca@gmail.com>	2021-08-07 12:30:03 +02:00
github-actions[bot]	56d4d87aeb	Auto-format code with black (#8895 ) Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>	2021-08-06 13:38:06 +02:00
Kabir Khan	1dfffe5fb4	No output info message in train (#8885 ) * Add info message that no output directory was provided in train * Update train.py * Fix logging	2021-08-05 09:21:22 +02:00
Adriane Boyd	fa2e7a4bbf	Fix spancat tests on GPU (#8872 ) * Fix spancat tests on GPU * Fix more spancat tests	2021-08-04 14:29:43 +02:00
Paul O'Leary McCann	77d698dcae	Fix check for RIGHT_ATTRS in dep matcher (#8807 ) * Fix check for RIGHT_ATTRs in dep matcher If a non-anchor node does not have RIGHT_ATTRS, the dep matcher throws an E100, which says that non-anchor nodes must have LEFT_ID, REL_OP, and RIGHT_ID. It specifically does not say RIGHT_ATTRS is required. A blank RIGHT_ATTRS is also valid, and patterns with one will be excepted. While not normal, sometimes a REL_OP is enough to specify a non-anchor node - maybe you just want the head of another node unconditionally, for example. This change just sets RIGHT_ATTRS to {} if not present. Alternatively changing E100 to state RIGHT_ATTRS is required could also be reasonable. * Fix test This test was written on the assumption that if `RIGHT_ATTRS` isn't present an error will be raised. Since the proposed changes make it so an error won't be raised this is no longer necessary. * Revert test, update error message Error message now lists missing keys, and RIGHT_ATTRS is required. * Use list of required keys in error message Also removes unused key param arg.	2021-08-04 09:20:41 +02:00
Adriane Boyd	941a591f3c	Pass excludes when serializing vocab (#8824 ) * Pass excludes when serializing vocab Additional minor bug fix: * Deserialize vocab in `EntityLinker.from_disk` * Add test for excluding strings on load * Fix formatting	2021-08-03 14:42:44 +02:00
Adriane Boyd	175847f92c	Support list values and INTERSECTS in Matcher (#8784 ) * Support list values and IS_INTERSECT in Matcher * Support list values as token attributes for set operators, not just as pattern values. * Add `IS_INTERSECT` operator. * Fix incorrect `ISSUBSET` and `ISSUPERSET` in schema and docs. * Rename IS_INTERSECT to INTERSECTS	2021-08-02 19:39:26 +02:00
Adriane Boyd	fbbbda1954	Fix start/end chars for empty and out-of-bounds spans (#8816 )	2021-08-02 19:07:19 +02:00
Adriane Boyd	9ad3b8cf8d	Only add sourced vectors hashes to meta if necessary (#8830 )	2021-08-02 18:22:35 +02:00
Nick Sorros	0485cdefcc	Add logger debug for project push and pull (#8860 ) * Add logger debug for project push and pull * Sign contributor agreement	2021-08-02 18:13:53 +02:00
themrmax	de076194c4	Make ConsoleLogger flush after each logging line (#8810 ) This is necessary to avoid "logging blackouts" when running training on Kubernetes pods	2021-08-02 14:33:38 +02:00
Adriane Boyd	81d3a1edb1	Use tokenizer URL_MATCH pattern in LIKE_URL (#8765 )	2021-07-27 12:07:01 +02:00
Ines Montani	7f21c7dfa2	Merge pull request #8794 from explosion/autoblack Auto-format code with black	2021-07-27 12:17:15 +10:00
Paul O'Leary McCann	284b530c63	Respect the no_skip value Seems like the logic for this was just left out. See #8796.	2021-07-24 15:31:17 +09:00
explosion-bot	a58ab6ea22	Auto-format code with black	2021-07-23 08:04:09 +00:00
Adriane Boyd	6bbc2b1956	Reload train corpus in debug data after initialize (#8776 )	2021-07-21 22:38:40 +02:00
Adriane Boyd	d48c01a6f7	Remove extraneous grc test file (#8768 )	2021-07-20 15:51:15 +02:00
Sofie Van Landeghem	ffaead8fe0	bump to 3.1.1	2021-07-19 14:48:27 +02:00
Sofie Van Landeghem	83e27d262e	negative tag annotation (#8731 ) * unit test to unlearn tag via negative annotation * bump thinc to 8.0.8	2021-07-19 14:39:11 +02:00
Adriane Boyd	0e4b96c97e	Update lexeme ranks for loaded vectors (#8640 ) Update the ranks for any lexemes that have been added to the vocab before the vectors are added to the model.	2021-07-19 18:25:54 +10:00
Adriane Boyd	e532c69475	Update Language.replace_pipe for disabled components (#8729 ) * Fix the index where the replacement in inserted to account for disabled components * Allow `Language.replace_pipe` to replace disabled components	2021-07-19 18:06:12 +10:00
Ines Montani	f90482d077	Tidy up and auto-format	2021-07-18 15:44:56 +10:00
Ines Montani	c0f436efbc	Merge pull request #8735 from explosion/autoblack	2021-07-17 13:46:17 +10:00
Ines Montani	483f3175cb	Tidy up [ci skip]	2021-07-17 13:43:15 +10:00
Ines Montani	15e6578f7d	Adjust formatting	2021-07-17 10:49:13 +10:00
explosion-bot	eff3d1088b	Auto-format code with black	2021-07-16 08:03:36 +00:00
Adriane Boyd	ac45c7c045	Add pre-commit to ignored requirements (#8728 )	2021-07-15 16:41:15 +02:00
jmyerston	993b0fab0e	Added ancient Greek language support (#8606 ) * Add ancient Greek language support Initial commit * Contributor Agreement * grc tokenizer test added and files formatted with black, unnecessary import removed Co-Authored-By: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Commas in lists fixed. __init__py added to test * Update lex_attrs.py * Update stop_words.py * Update stop_words.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2021-07-15 10:27:17 +02:00
Sofie Van Landeghem	77859beb99	spacy.ngram_range_suggester.v1 (#8699 )	2021-07-15 10:01:22 +02:00
Julien Rossi	e117573822	Adding noun_chunks to the DUTCH language model (nl) (#8529 ) * ✨ implement noun_chunks for dutch language * copy/paste FR and SV syntax iterators to accomodate UD tags * added tests with dutch text * signed contributor agreement * 🐛 fix noun chunks generator * built from scratch * define noun chunk as a single Noun-Phrase * includes some corner cases debugging (incorrect POS tagging) * test with provided annotated sample (POS, DEP) * ✅ fix failing test * CI pipeline did not like the added sample file * add the sample as a pytest fixture * Update spacy/lang/nl/syntax_iterators.py * Update spacy/lang/nl/syntax_iterators.py Code readability Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update spacy/tests/lang/nl/test_noun_chunks.py correct comment Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * finalize code * change "if next_word" into "if next_word is not None" Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2021-07-14 14:01:02 +02:00
Adriane Boyd	f9fd2889b7	Use 0-vector for OOV lexemes (#8639 )	2021-07-13 14:48:12 +10:00
Edward	8233359225	Fix preservation of spacy package meta (#8663 ) * update package meta with existing_meta and nlp_meta * Add spaCy contributor agreement * Added more info when creating readme	2021-07-12 11:18:52 +02:00
Adriane Boyd	d8805a1073	Fix ru/uk lemmatizer mp with spawn (#8657 ) Use an instance variable instead a class variable for the morphological analzyer so that multiprocessing with spawn is possible.	2021-07-09 15:36:56 +02:00
Adriane Boyd	b8e720fdb9	Fix Azerbaijani init, extend lang init tests (#8656 ) * Extend langs in initialize tests * Fix az init	2021-07-09 15:36:35 +02:00
explosion-bot	334f1f98d8	Auto-format code with black	2021-07-09 08:06:06 +00:00
Sofie Van Landeghem	64fac754fe	add spacy prefix to ngram_suggester.v1 (#8623 )	2021-07-07 08:09:30 +02:00
Sofie Van Landeghem	733e8ceea9	fix spancat initialize with labels (#8620 )	2021-07-06 19:08:25 +02:00
Sofie Van Landeghem	608fc1d623	avoid msg var impliciteness (#8619 ) * avoid msg var impliciteness * rename local msg * Add CI tests for debug data and train * Adjust debug data CLI test Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2021-07-06 19:08:08 +02:00
Sofie Van Landeghem	e7d747e3ee	TransitionBasedParser.v1 to legacy (#8586 ) * TransitionBasedParser.v1 to legacy * register sublayers * bump spacy-legacy to 3.0.7	2021-07-06 15:26:45 +02:00
Luca Dorigo	e8ef4a46d5	Add the right return type for Language.pipe and an overload for the as_tuples case (#8441 ) * Add the right return type for Language.pipe and an overload for the as_tuples version * Reformat, tidy up Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2021-07-06 14:18:40 +02:00
Sofie Van Landeghem	b9f59118bf	Fix silent evaluation (#8581 ) * fix silentness * sneak in docs typo fix * pass silent boolean instead	2021-07-06 14:16:19 +02:00
Sofie Van Landeghem	3daf57d70c	Small spancat fixes (#8614 ) * two small fixes + additional tests * rename	2021-07-06 14:15:41 +02:00
Ines Montani	327f83573a	Move scores per type handling into util function (#8590 )	2021-07-06 13:02:37 +02:00
Adriane Boyd	5fd0b5207e	Fix vectors check for sourced components (#8559 ) * Fix vectors check for sourced components Since vectors are not loaded when components are sourced, store a hash for the vectors of each sourced component and compare it to the loaded vectors after the vectors are loaded from the `[initialize]` block. * Pop temporary info * Remove stored hash in remove_pipe * Add default for pop * Add additional convert/debug/assemble CLI tests	2021-07-06 12:43:17 +02:00
Adriane Boyd	29906884c5	Raise an error for textcat with <2 labels (#8584 ) * Raise an error for textcat with <2 labels Raise an error if initializing a `textcat` component without at least two labels. * Add similar note to docs * Update positive_label description in API docs	2021-07-06 12:35:22 +02:00
explosion-bot	ee37288a1f	Auto-format code with black	2021-07-02 07:48:26 +00:00

1 2 3 4 5 ...

8758 Commits