spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-07-27 00:19:48 +03:00

Author	SHA1	Message	Date
Lj Miranda	7d50804644	Migrate regression tests into the main test suite (#9655 ) * Migrate regressions 1-1000 * Move serialize test to correct file * Remove tests that won't work in v3 * Migrate regressions 1000-1500 Removed regression test 1250 because v3 doesn't support the old LEX scheme anymore. * Add missing imports in serializer tests * Migrate tests 1500-2000 * Migrate regressions from 2000-2500 * Migrate regressions from 2501-3000 * Migrate regressions from 3000-3501 * Migrate regressions from 3501-4000 * Migrate regressions from 4001-4500 * Migrate regressions from 4501-5000 * Migrate regressions from 5001-5501 * Migrate regressions from 5501 to 7000 * Migrate regressions from 7001 to 8000 * Migrate remaining regression tests * Fixing missing imports * Update docs with new system [ci skip] * Update CONTRIBUTING.md - Fix formatting - Update wording * Remove lemmatizer tests in el lang * Move a few tests into the general tokenizer * Separate Doc and DocBin tests	2021-12-04 20:34:48 +01:00
Daniël de Kok	72f7f4e68a	morphologizer: avoid recreating label tuple for each token (#9764 ) * morphologizer: avoid recreating label tuple for each token The `labels` property converts the dictionary key set to a tuple. This property was used for every annotated token, recreating the tuple over and over again. Construct the tuple once in the set_annotations function and reuse it. On a Finnish pipeline that I was experimenting with, this results in a speedup of ~15% (~13000 -> ~15000 WPS). * tagger: avoid recreating label tuple for each token	2021-11-30 11:58:59 +01:00
Adriane Boyd	c19f0c1604	Switch to latest CI images (#9773 )	2021-11-30 10:08:51 +01:00
Narayan Acharya	1be8a4dab3	Displacy serve entity linking support without `manual=True` support. (#9748 ) * Add support for kb_id to be displayed via displacy.serve. The current support is only limited to the manual option in displacy.render * Commit to check pre-commit hooks are run. * Update spacy/displacy/__init__.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Changes as per suggestions on the PR. * Update website/docs/api/top-level.md Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update website/docs/api/top-level.md Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * tag option as new from 3.2.1 onwards Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>	2021-11-29 17:13:26 +01:00
Adriane Boyd	6763cbfdc0	Update Catalan acknowledgements for v3.2 (#9763 )	2021-11-29 14:14:21 +01:00
Paul O'Leary McCann	ac05de2c6c	Fix Language-specific factory handling in package command (#9674 ) * Use internal names for factories If a component factory is registered like `@French.factory(...)` instead of `@Language.factory(...)`, the name in the factories registry will be prefixed with the language code. However in the nlp.config object the factory will be listed without the language code. The `add_pipe` code has fallback logic to handle this, but packaging code and the registry itself don't. This change makes it so that the factory name in nlp.config is the language-specific form. It's not clear if this will break anything else, but it does seem to fix the inconsistency and resolve the specific user issue that brought this to our attention. * Change approach to use fallback in package lookup This adds fallback logic to the package lookup, so it doesn't have to touch the way the config is built. It seems to fix the tests too. * Remove unecessary line * Add test Thsi also adds an assert that seems to have been forgotten.	2021-11-29 08:31:02 +01:00
Richard Hudson	7b134b8fbd	New tests for a number of alpha languages (#9703 ) * Added Slovak * Added Slovenian tests * Added Estonian tests * Added Croatian tests * Added Latvian tests * Added Icelandic tests * Added Afrikaans tests * Added language-independent tests * Added Kannada tests * Tidied up * Added Albanian tests * Formatted with black * Added failing tests for anomalies * Update spacy/tests/lang/af/test_text.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Added context to failing Estonian tokenizer test Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Added context to failing Croatian tokenizer test Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Added context to failing Icelandic tokenizer test Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Added context to failing Latvian tokenizer test Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Added context to failing Slovak tokenizer test Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Added context to failing Slovenian tokenizer test Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2021-11-28 21:59:23 +01:00
Tuomo Hiippala	5c44533263	add entry for Applied Language Technology under "Courses" (#9755 ) Added the following entry into `universe.json`: ``` { "type": "education", "id": "applt-course", "title": "Applied Language Technology", "slogan": "NLP for newcomers using spaCy and Stanza", "description": "These learning materials provide an introduction to applied language technology for audiences who are unfamiliar with language technology and programming. The learning materials assume no previous knowledge of the Python programming language.", "url": "https://applied-language-technology.readthedocs.io/", "image": "https://www.mv.helsinki.fi/home/thiippal/images/applt-preview.jpg", "thumb": "https://applied-language-technology.readthedocs.io/en/latest/_static/logo.png", "author": "Tuomo Hiippala", "author_links": { "twitter": "tuomo_h", "github": "thiippal", "website": "https://www.mv.helsinki.fi/home/thiippal/" }, "category": ["courses"] }, ```	2021-11-28 19:33:16 +09:00
Natalia Rodnova	a4c43e5c57	Allow Matcher to match on ENT_ID and ENT_KB_ID (#9688 ) * Added ENT_ID and ENT_KB_ID into the list of the attributes that Matcher matches on * Added ENT_ID and ENT_KB_ID to TEST_PATTERNS in test_pattern_validation.py. Disabled tests that I added before * Update website/docs/api/matcher.md * Format * Remove skipped tests Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2021-11-24 10:37:10 +01:00
Richard Hudson	7fec5fd647	Merge pull request #9737 from Pantalaymon/patch-1 Create Pantalaymon.md	2021-11-24 09:56:43 +01:00
Valentin-Gabriel Soumah	0bbf86bba8	Create Pantalaymon.md Submitting agreement to spacy in order to contribute to Coreferee project .	2021-11-23 17:29:23 +01:00
Duygu Altinok	a7d7e80adb	EntityRuler improve disk load error message (#9658 ) * added error string * added serialization test * added more to if statements * wrote file to tempdir * added tempdir * changed parameter a bit * Update spacy/tests/pipeline/test_entity_ruler.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2021-11-23 16:26:05 +01:00
Adriane Boyd	9ac6d4991e	Add doc_cleaner component (#9659 ) * Add doc_cleaner component * Fix types * Fix loop * Rephrase method description	2021-11-23 15:33:33 +01:00
Adriane Boyd	a77f50baa4	Allow Scorer.score_spans to handle pred docs with missing annotation (#9701 ) If the predicted docs are missing annotation according to `has_annotation`, treat the docs as having no predictions rather than raising errors when the annotation is missing. The motivation for this is a combined tokenization+sents scorer for a component where the sents annotation is optional. To provide a single scorer in the component factory, it needs to be possible for the scorer to continue despite missing sents annotation in the case where the component is not annotating sents.	2021-11-23 15:17:19 +01:00
Adriane Boyd	36c7047946	Use reference parse to initialize parser moves (#9722 )	2021-11-23 14:55:55 +01:00
Paul O'Leary McCann	52b8c2d2e0	Add note on batch contract for listeners (#9691 ) * Add note on batch contract Using listeners requires batches to be consistent. This is obvious if you understand how the listener works, but it wasn't clearly stated in the Docs, and was subtle enough that the EntityLinker missed it. There is probably a clearer way to explain what the actual requirement is, but I figure this is a good start. * Rewrite to clarify role of caching	2021-11-22 11:06:07 +01:00
Sofie Van Landeghem	13645dcbf5	add note that annotating components is new since 3.1 (#9678 )	2021-11-22 14:43:11 +09:00
Adriane Boyd	0e93b315f3	Convert labels to strings for README in package CLI (#9694 )	2021-11-19 08:51:46 +01:00
Adriane Boyd	ea450d652c	Exclude strings from v3.2+ source vector checks (#9697 ) Exclude strings from `Vector.to_bytes()` comparions for v3.2+ `Vectors` that now include the string store so that the source vector comparison is only comparing the vectors and not the strings.	2021-11-19 08:51:19 +01:00
Paul O'Leary McCann	f3981bd0c8	Clarify how to fill in init_tok2vec after pretraining (#9639 ) * Clarify how to fill in init_tok2vec after pretraining * Ignore init_tok2vec arg in pretraining * Update docs, config setting * Remove obsolete note about not filling init_tok2vec early This seems to have also caught some lines that needed cleanup.	2021-11-18 15:38:30 +01:00
Vishnu Nandakumar	86fa37e8ba	Update universe.json with new library eng_spacysentiment (#9679 ) * Update universe.json * Update universe.json * Cleanup fields Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>	2021-11-16 14:06:19 +09:00
Adriane Boyd	c9baf9d196	Fix spancat for empty docs and zero suggestions (#9654 ) * Fix spancat for empty docs and zero suggestions * Use ops.xp.zeros in test	2021-11-15 12:40:55 +01:00
github-actions[bot]	67d8c8a081	Auto-format code with black (#9664 ) Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>	2021-11-12 10:00:03 +01:00
Sofie Van Landeghem	24cdd4c88e	Merge pull request #9638 from polm/fix/optional-pretrain-path Make Jsonl Corpus reader path optional again	2021-11-09 10:45:14 +01:00
Paul O'Leary McCann	8aa2d32ca9	Update jsonlcorpus constructor types	2021-11-09 16:20:19 +09:00
Paul O'Leary McCann	71fb00ed95	Update spacy/training/corpus.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2021-11-08 10:02:29 +00:00
Sofie Van Landeghem	c97f29c593	Merge pull request #9629 from ljvmiranda921/chore/migrate-regressions Migrate regression and other tests to the new pytest marker	2021-11-08 09:07:38 +01:00
Paul O'Leary McCann	141f12b92e	Make Jsonl Corpus reader optional again	2021-11-07 18:56:23 +09:00
Lj Miranda	909177589d	Remove utility script	2021-11-06 06:35:58 +08:00
Ines Montani	86af0234ab	Update version [ci skip]	2021-11-05 19:02:35 +01:00
Adriane Boyd	216ed231a9	What's new in v3.2 (#9633 ) * What's new in v3.2 * Fix formatting * Fix typo * Redo thanks * Formatting * Fix typo * Fix project links * Fix typo * Minimal intro, floret python module * Rephrase * Rephrase, extend * Rephrase * Update links and formatting [ci skip] * Minor correction * Fix typo Co-authored-by: Ines Montani <ines@ines.io>	2021-11-05 16:31:14 +01:00
Adriane Boyd	0fc3dee772	Merge pull request #9596 from adrianeboyd/tests/reenable-v3.2.0-tests Reenable tests for v3.2.0	2021-11-05 10:54:30 +01:00
github-actions[bot]	5cdb7eb5c2	Auto-format code with black (#9631 ) Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2021-11-05 09:58:36 +01:00
Adriane Boyd	e6f91b6f27	Format (#9630 )	2021-11-05 09:56:26 +01:00
Lj Miranda	8e7deaf210	Add missing imports in some regression tests - test_issue7001-8000.py - test_issue8190.py	2021-11-05 11:47:59 +08:00
Lj Miranda	addeb34bc4	Decorate regression tests Even if the issue number is already in the file, I still decorated them just to follow the convention found in test_issue8168.py	2021-11-05 11:47:44 +08:00
Lj Miranda	91dec2c76e	Decorate non-regression tests	2021-11-05 11:47:33 +08:00
Lj Miranda	199943deb4	Add simple script to add pytest marks	2021-11-05 11:47:28 +08:00
Duygu Altinok	f0e8c9fe58	Spanish noun chunks review (#9537 ) * updated syntax iters * formatted the code * added prepositional objects * code clean up * eliminated left attached adp * added es vocab * added basic tests * fixed typo * fixed typo * list to set * fixed doc name * added code for conj * more tests * differentiated adjectives and flat * fixed typo * added compounds * more compounds * tests for compounds * tests for nominal modifiers * fixed typo * fixed typo * formatted file * reformatted tests * fixed typo * fixed punct typo * formatted after changes * added indirect object * added full sentence examples * added longer full sentence examples * fixed sentence length of test * added passive subj * added test case by Damian	2021-11-05 00:46:36 +01:00
Duygu Altinok	6e6650307d	Portuguese noun chunks review (#9559 ) * added tests * added pt vocab * transferred spanish * added syntax iters * fixed parenthesis * added nmod example * added relative pron * fixed rel pron * added rel subclause * corrected typo * added more NP chains * long sentence * fixed typo * fixed typo * fixed typo * corrected heads * added passive subj * added pass subj * added passive obj * refinement to rights * went back to odl * fixed test * fixed typo * fixed typo * formatted * Format * Format test cases Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2021-11-04 23:55:49 +01:00
Adriane Boyd	2bf52c44b1	Merge pull request #9612 from adrianeboyd/chore/switch-to-master-v3.2.0 Switch v3.2.0 to master	2021-11-03 16:27:34 +01:00
Adriane Boyd	07dea324f6	Merge remote-tracking branch 'upstream/develop' into chore/switch-to-master-v3.2.0	2021-11-03 15:32:18 +01:00
Bram Vanroy	cab9209c3d	use metaclass to decorate errors (#9593 )	2021-11-03 15:29:32 +01:00
Paul O'Leary McCann	c1cc94a33a	Fix typo about receptive field size (#9564 )	2021-11-03 15:16:55 +01:00
Adriane Boyd	e06bbf72a4	Fix tok2vec-less textcat generation in website quickstart (#9610 )	2021-11-03 15:11:07 +01:00
Adriane Boyd	db0d8c56d0	Add test for Language.pipe as_tuples with custom error handlers (#9608 ) * make nlp.pipe() return None docs when no exceptions are (re-)raised during error handling * Remove changes other than as_tuples test * Only check warning count for one process * Fix types * Format Co-authored-by: Xi Bai <xi.bai.ed@gmail.com>	2021-11-03 10:57:34 +01:00
Adriane Boyd	79cea03983	Update website model display (#9589 ) * Remove vectors from core trf model descriptions * Update accuracy labels and exclude morph_acc for ja	2021-11-03 09:56:00 +01:00
Paul O'Leary McCann	e43639b27a	Add note about round-trip serializing pipeline to API docs (#9583 )	2021-11-03 09:55:30 +01:00
Adriane Boyd	6eee024ff6	Pickle Doc._context (#9603 )	2021-11-03 09:14:29 +01:00
Adriane Boyd	61daac54e4	Serialize _context separately in multiprocessing pipe (#9597 ) * Serialize _context with Doc * Revert "Serialize _context with Doc" This reverts commit `161f1fac91`. * Serialize Doc._context separately for multiprocessing pipe	2021-11-03 07:51:53 +01:00

1 2 3 4 5 ...

15246 Commits