spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-04-17 15:41:59 +03:00

Author	SHA1	Message	Date
Marek Šuppa	67ecac633f	fix: Add missing comma to `examples.py` (#10167 ) * This comma has been most probably been left out unintentionally, leading to string concatenation between the two consecutive lines. This issue has been found automatically using a regular expression.	2022-01-30 16:43:29 +09:00
Adriane Boyd	4f441dfa24	Fix infix as prefix in Tokenizer.explain (#10140 ) * Fix infix as prefix in Tokenizer.explain Update `Tokenizer.explain` to align with the `Tokenizer` algorithm: * skip infix matches that are prefixes in the current substring * Update tokenizer pseudocode in docs	2022-01-28 17:00:54 +01:00
Eduard Zorita	30cf9d6a05	Update typing hints (#10109 ) * Improve typing hints for Matcher.__call__ * Add typing hints for DependencyMatcher * Add typing hints to underscore extensions * Update Doc.tensor type (requires numpy 1.21) * Fix typing hints for Language.component decorator * Use generic np.ndarray type in Doc to avoid numpy version update * Fix mypy errors * Fix cyclic import caused by Underscore typing hints * Use Literal type from spacy.compat * Update matcher.pyi import format Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-01-28 16:59:54 +01:00
Adriane Boyd	09734c56fc	Use simple suggester for spancat initialization (#10143 ) Instead of the running the actual suggester, which may require annotation from annotating components that is not necessarily present in the reference docs, use the built-in 1-gram suggester.	2022-01-28 09:34:23 +01:00
github-actions[bot]	6d4db5c3c7	Auto-format code with black (#10106 ) Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>	2022-01-21 10:01:10 +01:00
pepemedigu	2abd380f2d	Update lex_attrs.py for Spanish with ordinals (#10038 ) * Update lex_attrs.py Add ordinal words * black formatting Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-01-20 15:44:13 +01:00
Sofie Van Landeghem	4465fe0306	Merge branch 'develop' into feature/master_copy	2022-01-20 13:36:17 +01:00
Duygu Altinok	47a2916801	Intify IOB (#9738 ) * added iob to int * added tests * added iob strings * added error * blacked attrs * Update spacy/tests/lang/test_attrs.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update spacy/attrs.pyx Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * added iob strings as global * minor refinement with iob * removed iob strings from token * changed to uppercase * cleaned and went back to master version * imported iob from attrs * Update and format errors * Support and test both str and int ENT_IOB key Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-01-20 13:19:38 +01:00
Duygu Altinok	268ddf8a06	Add ENT_IOB key to Matcher (#9649 ) * added new field * added exception for IOb strings * minor refinement to schema * removed field * fixed typo * imported numeriacla val * changed the code bit * cosmetics * added test for matcher * set ents of moc docs * added invalid pattern * minor update to documentation * blacked matcher * added pattern validation * add IOB vals to schema * changed into test * mypy compat * cleaned left over * added compat import * changed type * added compat import * changed literal a bit * went back to old * made explicit type * Update spacy/schemas.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update spacy/schemas.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update spacy/schemas.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-01-20 13:18:39 +01:00
Daniël de Kok	6984f55277	Merge pull request #10048 from danieldk/index-arcs-by-head Use constant-time head lookups in StateC::{L,R}	2022-01-20 13:06:14 +01:00
Paul O'Leary McCann	32bd3856b3	Rename FACILITY to FAC in color list (#10067 ) This matches the English models	2022-01-20 12:00:28 +01:00
Adriane Boyd	a55212fca0	Determine labels by factory name in debug data (#10079 ) * Determine labels by factory name in debug data For all components, return labels for all components with the corresponding factory name rather than for only the default name. For `spancat`, return labels as a dict keyed by `spans_key`. * Refactor for typing * Add test * Use assert instead of cast, removed unneeded arg * Mark test as slow	2022-01-20 11:42:52 +01:00
Richard Hudson	e9c6314539	Bugfix for similarity return types (#10051 )	2022-01-20 11:40:46 +01:00
Daniël de Kok	50d2a2c930	User fewer Vector internals (#9879 ) * Use Vectors.shape rather than Vectors.data.shape * Use Vectors.size rather than Vectors.data.size * Add Vectors.to_ops to move data between different ops * Add documentation for Vector.to_ops	2022-01-18 17:14:35 +01:00
Adriane Boyd	4dfd559e55	Fix spaces in Doc.from_docs for empty docs (#10052 ) Fix spaces in `Doc.from_docs(ensure_whitespace=True)` for cases where an doc ending in whitespace is followed by an empty doc.	2022-01-18 17:12:42 +01:00
Paul O'Leary McCann	c28e33637b	Mark flaky spancat test so it doesn't fail the build (#10075 ) * Mark flaky spancat test so it doesn't fail the build * Skip, don't run and ignore	2022-01-18 09:36:28 +01:00
Natalia Rodnova	47ea6704f1	Span richcmp fix (#9956 ) * Corrected Span's __richcmp__ implementation to take end, label and kb_id in consideration * Updated test * Updated test * Removed formatting from a test for readability sake * Use same tuples for all comparisons Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2022-01-17 11:17:49 +01:00
Adriane Boyd	add52935ff	Revert "Bump sudachipy version (#9917 )" (#10071 ) This reverts commit `58bdd8607b`.	2022-01-17 10:38:37 +01:00
Paul O'Leary McCann	58bdd8607b	Bump sudachipy version (#9917 ) * Edited Slovenian stop words list (#9707) * Noun chunks for Italian (#9662) * added it vocab * copied portuguese * added possessive determiner * added conjed Nps * added nmoded Nps * test misc * more examples * fixed typo * fixed parenth * fixed comma * comma fix * added syntax iters * fix some index problems * fixed index * corrected heads for test case * fixed tets case * fixed determiner gender * cleaned left over * added example with apostophe * French NP review (#9667) * adapted from pt * added basic tests * added fr vocab * fixed noun chunks * more examples * typo fix * changed naming * changed the naming * typo fix * Add Japanese kana characters to default exceptions (fix #9693) (#9742) This includes the main kana, or phonetic characters, used in Japanese. There are some supplemental kana blocks in Unicode outside the BMP that could also be included, but because their actual use is rare I omitted them for now, but maybe they should be added. The omitted blocks are: - Kana Supplement - Kana Extended (A and B) - Small Kana Extension * Remove NER words from stop words in Norwegian (#9820) Default stop words in Norwegian bokmål (nb) in Spacy contain important entities, e.g. France, Germany, Russia, Sweden and USA, police district, important units of time, e.g. months and days of the week, and organisations. Nobody expects their presence among the default stop words. There is a danger of users complying with the general recommendation of filtering out stop words, while being unaware of filtering out important entities from their data. See explanation in https://github.com/explosion/spaCy/issues/3052#issuecomment-986756711 and comment https://github.com/explosion/spaCy/issues/3052#issuecomment-986951831 * Bump sudachipy version * Update sudachipy versions * Bump versions Bumping to the most recent dictionary just to keep thing current. Bumping sudachipy to 5.2 because older versions don't support recent dictionaries. Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Richard Hudson <richard@explosion.ai> Co-authored-by: Duygu Altinok <duygu@explosion.ai> Co-authored-by: Haakon Meland Eriksen <haakon.eriksen@far.no>	2022-01-17 08:16:22 +01:00
Daniël de Kok	63fa55089d	Use constant-time head lookups in StateC::{L,R} This change changes the type of left/right-arc collections from vector[ArcC] to unordered_map[int, vector[Arc]], so that the arcs are keyed by the head. This allows us to find all the left/right arcs for a particular head in constant time in StateC::{L,R}. Benchmarks with long docs (N is the number of text repetitions): Before (using #10019): N Time (s) 400 3.2 800 5.0 1600 9.5 3200 23.2 6400 66.8 12800 220.0 After (this commit): N Time (s) 400 3.1 800 4.3 1600 6.7 3200 12.0 6400 22.0 12800 42.0 Related to #9858 and #10019.	2022-01-13 12:08:46 +01:00
Daniël de Kok	677c1a3507	Speed up the StateC::L feature function (#10019 ) * Speed up the StateC::L feature function This function gets the n-th most-recent left-arc with a particular head. Before this change, StateC::L would construct a vector of all left-arcs with the given head and then pick the n-th most recent from that vector. Since the number of left-arcs strongly correlates with the doc length and the feature is constructed for every transition, this can make transition-parsing quadratic. With this change StateC::L: - Searches left-arcs backwards. - Stops early when the n-th matching transition is found. - Does not construct a vector (reducing memory pressure). This change doesn't avoid the linear search when the transition that is queried does not occur in the left-arcs. Regardless, performance is improved quite a bit with very long docs: Before: N Time 400 3.3 800 5.4 1600 11.6 3200 30.7 After: N Time 400 3.2 800 5.0 1600 9.5 3200 23.2 We can probably do better with more tailored data structures, but I first wanted to make a low-impact PR. Found while investigating #9858. * StateC::L: simplify loop	2022-01-13 09:29:58 +01:00
Daniël de Kok	28299644fc	Speed up the StateC::L feature function (#10019 ) * Speed up the StateC::L feature function This function gets the n-th most-recent left-arc with a particular head. Before this change, StateC::L would construct a vector of all left-arcs with the given head and then pick the n-th most recent from that vector. Since the number of left-arcs strongly correlates with the doc length and the feature is constructed for every transition, this can make transition-parsing quadratic. With this change StateC::L: - Searches left-arcs backwards. - Stops early when the n-th matching transition is found. - Does not construct a vector (reducing memory pressure). This change doesn't avoid the linear search when the transition that is queried does not occur in the left-arcs. Regardless, performance is improved quite a bit with very long docs: Before: N Time 400 3.3 800 5.4 1600 11.6 3200 30.7 After: N Time 400 3.2 800 5.0 1600 9.5 3200 23.2 We can probably do better with more tailored data structures, but I first wanted to make a low-impact PR. Found while investigating #9858. * StateC::L: simplify loop	2022-01-13 09:03:55 +01:00
jsnfly	176a90edee	Fix texcat loss scaling (#9904 ) (#10002 ) * add failing test for issue 9904 * remove division by batch size and summation before applying the mean Co-authored-by: jonas <jsnfly@gmx.de>	2022-01-13 09:03:23 +01:00
Sofie Van Landeghem	d8a3012539	Merge pull request #10037 from explosion/master Update develop with master	2022-01-12 12:29:23 +01:00
Ryn Daniels	057b8c64c0	Check for assets with size of 0 bytes (#10026 ) * Check for assets with size of 0 bytes * Update spacy/cli/project/assets.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-01-12 10:34:23 +01:00
Sofie Van Landeghem	067a44a417	Merge pull request #9987 from explosion/master Update develop with commits from master	2022-01-05 11:49:50 +01:00
Lj Miranda	00e7bf5ffd	Add a few docs to the default_config.cfg (#9981 ) * Clarify patience hyperparameter The current value for patience doesn't seem to indicate that it's pointing to the number of steps. It may be useful to specify that explicitly. Ref: https://github.com/explosion/spaCy/discussions/7450 Ref: https://github.com/explosion/spaCy/discussions/7465 * Update docs for max_steps	2022-01-05 09:16:40 +01:00
Duygu Altinok	55cf492218	Feat/debug data warn spread ents (#9960 ) * added check for crossing boundaries * formatted blacked * Rephrasing slightly Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2022-01-04 18:22:10 +01:00
Sofie Van Landeghem	56dcb39fb7	Fix references to config file in the docs & UX (#9961 ) * doc fixes around config file * fix typo * clarify default	2022-01-04 14:31:26 +01:00
Sofie Van Landeghem	029a48e340	fix type of lexeme.rank (#9979 )	2022-01-04 13:15:25 +01:00
Florian Cäsar	86e71e7b19	Fix Scorer.score_cats for missing labels (#9443 ) * Fix Scorer.score_cats for missing labels * Add test case for Scorer.score_cats missing labels * semantic nitpick * black formatting * adjust test to give different results depending on multi_label setting * fix loss function according to whether or not missing values are supported * add note to docs * small fixes * make mypy happy * Update spacy/pipeline/textcat.py Co-authored-by: Florian Cäsar <florian.caesar@pm.me> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: svlandeg <svlandeg@github.com>	2021-12-29 11:04:39 +01:00
Sofie Van Landeghem	b8106e0f95	Merge pull request #9951 from explosion/master Update develop branch with master	2021-12-29 10:11:43 +01:00
Peter Baumgartner	72abf9e102	MultiHashEmbed vector docs correction (#9918 )	2021-12-27 11:18:08 +01:00
Duygu Altinok	7ec1452f5f	added ellided forms (#9878 ) * added ellided forms * rearranged a bit * rearranged a bit * added stopword tests * blacked tests file	2021-12-23 13:41:01 +01:00
Andrew Janco	3cfeb518ee	Handle "_" value for token pos in conllu data (#9903 ) * change '_' to '' to allow Token.pos, when no value for token pos in conllu data * Minor code style Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2021-12-21 15:46:33 +01:00
Adriane Boyd	837d241b68	Make floret murmurhash endian-neutral (#9735 )	2021-12-20 17:11:31 +01:00
Sofie Van Landeghem	7847839003	Merge pull request #9891 from explosion/master Update develop with master	2021-12-17 14:01:27 +01:00
Adriane Boyd	94fbd88521	Use dict.copy().items() instead of list(.items()) (#9868 )	2021-12-16 09:17:33 +01:00
antonpibm	ac45ae3779	Update Tokenizer documentation to reflect token_match and url_match signatures (#9859 )	2021-12-15 09:34:33 +01:00
Adriane Boyd	800737b416	Set version to v3.2.1 (#9823 )	2021-12-07 10:51:45 +01:00
Haakon Meland Eriksen	251119455d	Remove NER words from stop words in Norwegian (#9820 ) Default stop words in Norwegian bokmål (nb) in Spacy contain important entities, e.g. France, Germany, Russia, Sweden and USA, police district, important units of time, e.g. months and days of the week, and organisations. Nobody expects their presence among the default stop words. There is a danger of users complying with the general recommendation of filtering out stop words, while being unaware of filtering out important entities from their data. See explanation in https://github.com/explosion/spaCy/issues/3052#issuecomment-986756711 and comment https://github.com/explosion/spaCy/issues/3052#issuecomment-986951831	2021-12-07 09:45:10 +01:00
Adriane Boyd	a0cdc2b007	Use Language.pipe in evaluate (#9800 )	2021-12-06 20:39:15 +01:00
Adriane Boyd	9964243eb2	Make the Tagger neg_prefix configurable (#9802 )	2021-12-06 18:04:44 +01:00
Duygu Altinok	b56b9e7f31	Entity ruler remove pattern (#9685 ) * added ruler coe * added error for none existing pattern * changed error to warning * changed error to warning * added basic tests * fixed place * added test files * went back to error * went back to pattern error * minor change to docs * changed style * changed doc * changed error slightly * added remove to phrasem api * error key already existed * phrase matcher match code to api * blacked tests * moved comments before expr * corrected error no * Update website/docs/api/entityruler.md Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update website/docs/api/entityruler.md Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2021-12-06 15:32:49 +01:00
Natalia Rodnova	472740d613	Added sents property to Span for Spans spanning over several sentences (#9699 ) * Added sents property to Span class that returns a generator of sentences the Span belongs to * Added description to Span.sents property * Update test_span to clarify the difference between span.sent and span.sents Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update spacy/tests/doc/test_span.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Fix documentation typos in spacy/tokens/span.pyx Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update Span.sents doc string in spacy/tokens/span.pyx Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Parametrized test_span_spans * Corrected Span.sents to check for span-level hook first. Also, made Span.sent respect doc-level sents hook if no span-level hook is provided * Corrected Span ocumentation copy/paste issue * Put back accidentally deleted lines * Fixed formatting in span.pyx * Moved check for SENT_START annotation after user hooks in Span.sents * add version where the property was introduced Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2021-12-06 09:58:01 +01:00
Lj Miranda	7d50804644	Migrate regression tests into the main test suite (#9655 ) * Migrate regressions 1-1000 * Move serialize test to correct file * Remove tests that won't work in v3 * Migrate regressions 1000-1500 Removed regression test 1250 because v3 doesn't support the old LEX scheme anymore. * Add missing imports in serializer tests * Migrate tests 1500-2000 * Migrate regressions from 2000-2500 * Migrate regressions from 2501-3000 * Migrate regressions from 3000-3501 * Migrate regressions from 3501-4000 * Migrate regressions from 4001-4500 * Migrate regressions from 4501-5000 * Migrate regressions from 5001-5501 * Migrate regressions from 5501 to 7000 * Migrate regressions from 7001 to 8000 * Migrate remaining regression tests * Fixing missing imports * Update docs with new system [ci skip] * Update CONTRIBUTING.md - Fix formatting - Update wording * Remove lemmatizer tests in el lang * Move a few tests into the general tokenizer * Separate Doc and DocBin tests	2021-12-04 20:34:48 +01:00
Paul O'Leary McCann	b4d526c357	Add Japanese kana characters to default exceptions (fix #9693 ) (#9742 ) This includes the main kana, or phonetic characters, used in Japanese. There are some supplemental kana blocks in Unicode outside the BMP that could also be included, but because their actual use is rare I omitted them for now, but maybe they should be added. The omitted blocks are: - Kana Supplement - Kana Extended (A and B) - Small Kana Extension	2021-11-30 23:36:39 +01:00
Sofie Van Landeghem	58e29776bd	Merge pull request #9777 from explosion/master Update develop with master	2021-11-30 14:01:23 +01:00
Duygu Altinok	29f28d1f3e	French NP review (#9667 ) * adapted from pt * added basic tests * added fr vocab * fixed noun chunks * more examples * typo fix * changed naming * changed the naming * typo fix	2021-11-30 12:19:07 +01:00
Daniël de Kok	72f7f4e68a	morphologizer: avoid recreating label tuple for each token (#9764 ) * morphologizer: avoid recreating label tuple for each token The `labels` property converts the dictionary key set to a tuple. This property was used for every annotated token, recreating the tuple over and over again. Construct the tuple once in the set_annotations function and reuse it. On a Finnish pipeline that I was experimenting with, this results in a speedup of ~15% (~13000 -> ~15000 WPS). * tagger: avoid recreating label tuple for each token	2021-11-30 11:58:59 +01:00
Narayan Acharya	1be8a4dab3	Displacy serve entity linking support without `manual=True` support. (#9748 ) * Add support for kb_id to be displayed via displacy.serve. The current support is only limited to the manual option in displacy.render * Commit to check pre-commit hooks are run. * Update spacy/displacy/__init__.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Changes as per suggestions on the PR. * Update website/docs/api/top-level.md Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update website/docs/api/top-level.md Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * tag option as new from 3.2.1 onwards Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>	2021-11-29 17:13:26 +01:00
Paul O'Leary McCann	ac05de2c6c	Fix Language-specific factory handling in package command (#9674 ) * Use internal names for factories If a component factory is registered like `@French.factory(...)` instead of `@Language.factory(...)`, the name in the factories registry will be prefixed with the language code. However in the nlp.config object the factory will be listed without the language code. The `add_pipe` code has fallback logic to handle this, but packaging code and the registry itself don't. This change makes it so that the factory name in nlp.config is the language-specific form. It's not clear if this will break anything else, but it does seem to fix the inconsistency and resolve the specific user issue that brought this to our attention. * Change approach to use fallback in package lookup This adds fallback logic to the package lookup, so it doesn't have to touch the way the config is built. It seems to fix the tests too. * Remove unecessary line * Add test Thsi also adds an assert that seems to have been forgotten.	2021-11-29 08:31:02 +01:00
Richard Hudson	7b134b8fbd	New tests for a number of alpha languages (#9703 ) * Added Slovak * Added Slovenian tests * Added Estonian tests * Added Croatian tests * Added Latvian tests * Added Icelandic tests * Added Afrikaans tests * Added language-independent tests * Added Kannada tests * Tidied up * Added Albanian tests * Formatted with black * Added failing tests for anomalies * Update spacy/tests/lang/af/test_text.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Added context to failing Estonian tokenizer test Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Added context to failing Croatian tokenizer test Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Added context to failing Icelandic tokenizer test Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Added context to failing Latvian tokenizer test Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Added context to failing Slovak tokenizer test Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Added context to failing Slovenian tokenizer test Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2021-11-28 21:59:23 +01:00
Natalia Rodnova	a4c43e5c57	Allow Matcher to match on ENT_ID and ENT_KB_ID (#9688 ) * Added ENT_ID and ENT_KB_ID into the list of the attributes that Matcher matches on * Added ENT_ID and ENT_KB_ID to TEST_PATTERNS in test_pattern_validation.py. Disabled tests that I added before * Update website/docs/api/matcher.md * Format * Remove skipped tests Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2021-11-24 10:37:10 +01:00
Duygu Altinok	25bd9f9d48	Noun chunks for Italian (#9662 ) * added it vocab * copied portuguese * added possessive determiner * added conjed Nps * added nmoded Nps * test misc * more examples * fixed typo * fixed parenth * fixed comma * comma fix * added syntax iters * fix some index problems * fixed index * corrected heads for test case * fixed tets case * fixed determiner gender * cleaned left over * added example with apostophe	2021-11-23 16:29:25 +01:00
Duygu Altinok	a7d7e80adb	EntityRuler improve disk load error message (#9658 ) * added error string * added serialization test * added more to if statements * wrote file to tempdir * added tempdir * changed parameter a bit * Update spacy/tests/pipeline/test_entity_ruler.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2021-11-23 16:26:05 +01:00
Adriane Boyd	9ac6d4991e	Add doc_cleaner component (#9659 ) * Add doc_cleaner component * Fix types * Fix loop * Rephrase method description	2021-11-23 15:33:33 +01:00
Adriane Boyd	a77f50baa4	Allow Scorer.score_spans to handle pred docs with missing annotation (#9701 ) If the predicted docs are missing annotation according to `has_annotation`, treat the docs as having no predictions rather than raising errors when the annotation is missing. The motivation for this is a combined tokenization+sents scorer for a component where the sents annotation is optional. To provide a single scorer in the component factory, it needs to be possible for the scorer to continue despite missing sents annotation in the case where the component is not annotating sents.	2021-11-23 15:17:19 +01:00
Adriane Boyd	36c7047946	Use reference parse to initialize parser moves (#9722 )	2021-11-23 14:55:55 +01:00
Richard Hudson	a1f25412da	Edited Slovenian stop words list (#9707 )	2021-11-22 09:46:34 +01:00
Adriane Boyd	0e93b315f3	Convert labels to strings for README in package CLI (#9694 )	2021-11-19 08:51:46 +01:00
Adriane Boyd	ea450d652c	Exclude strings from v3.2+ source vector checks (#9697 ) Exclude strings from `Vector.to_bytes()` comparions for v3.2+ `Vectors` that now include the string store so that the source vector comparison is only comparing the vectors and not the strings.	2021-11-19 08:51:19 +01:00
Paul O'Leary McCann	f3981bd0c8	Clarify how to fill in init_tok2vec after pretraining (#9639 ) * Clarify how to fill in init_tok2vec after pretraining * Ignore init_tok2vec arg in pretraining * Update docs, config setting * Remove obsolete note about not filling init_tok2vec early This seems to have also caught some lines that needed cleanup.	2021-11-18 15:38:30 +01:00
Adriane Boyd	c9baf9d196	Fix spancat for empty docs and zero suggestions (#9654 ) * Fix spancat for empty docs and zero suggestions * Use ops.xp.zeros in test	2021-11-15 12:40:55 +01:00
github-actions[bot]	67d8c8a081	Auto-format code with black (#9664 ) Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>	2021-11-12 10:00:03 +01:00
Sofie Van Landeghem	24cdd4c88e	Merge pull request #9638 from polm/fix/optional-pretrain-path Make Jsonl Corpus reader path optional again	2021-11-09 10:45:14 +01:00
Paul O'Leary McCann	8aa2d32ca9	Update jsonlcorpus constructor types	2021-11-09 16:20:19 +09:00
Paul O'Leary McCann	71fb00ed95	Update spacy/training/corpus.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2021-11-08 10:02:29 +00:00
Sofie Van Landeghem	c97f29c593	Merge pull request #9629 from ljvmiranda921/chore/migrate-regressions Migrate regression and other tests to the new pytest marker	2021-11-08 09:07:38 +01:00
Paul O'Leary McCann	141f12b92e	Make Jsonl Corpus reader optional again	2021-11-07 18:56:23 +09:00
Lj Miranda	909177589d	Remove utility script	2021-11-06 06:35:58 +08:00
Adriane Boyd	0fc3dee772	Merge pull request #9596 from adrianeboyd/tests/reenable-v3.2.0-tests Reenable tests for v3.2.0	2021-11-05 10:54:30 +01:00
github-actions[bot]	5cdb7eb5c2	Auto-format code with black (#9631 ) Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2021-11-05 09:58:36 +01:00
Adriane Boyd	e6f91b6f27	Format (#9630 )	2021-11-05 09:56:26 +01:00
Lj Miranda	8e7deaf210	Add missing imports in some regression tests - test_issue7001-8000.py - test_issue8190.py	2021-11-05 11:47:59 +08:00
Lj Miranda	addeb34bc4	Decorate regression tests Even if the issue number is already in the file, I still decorated them just to follow the convention found in test_issue8168.py	2021-11-05 11:47:44 +08:00
Lj Miranda	91dec2c76e	Decorate non-regression tests	2021-11-05 11:47:33 +08:00
Lj Miranda	199943deb4	Add simple script to add pytest marks	2021-11-05 11:47:28 +08:00
Duygu Altinok	f0e8c9fe58	Spanish noun chunks review (#9537 ) * updated syntax iters * formatted the code * added prepositional objects * code clean up * eliminated left attached adp * added es vocab * added basic tests * fixed typo * fixed typo * list to set * fixed doc name * added code for conj * more tests * differentiated adjectives and flat * fixed typo * added compounds * more compounds * tests for compounds * tests for nominal modifiers * fixed typo * fixed typo * formatted file * reformatted tests * fixed typo * fixed punct typo * formatted after changes * added indirect object * added full sentence examples * added longer full sentence examples * fixed sentence length of test * added passive subj * added test case by Damian	2021-11-05 00:46:36 +01:00
Duygu Altinok	6e6650307d	Portuguese noun chunks review (#9559 ) * added tests * added pt vocab * transferred spanish * added syntax iters * fixed parenthesis * added nmod example * added relative pron * fixed rel pron * added rel subclause * corrected typo * added more NP chains * long sentence * fixed typo * fixed typo * fixed typo * corrected heads * added passive subj * added pass subj * added passive obj * refinement to rights * went back to odl * fixed test * fixed typo * fixed typo * formatted * Format * Format test cases Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2021-11-04 23:55:49 +01:00
Adriane Boyd	07dea324f6	Merge remote-tracking branch 'upstream/develop' into chore/switch-to-master-v3.2.0	2021-11-03 15:32:18 +01:00
Bram Vanroy	cab9209c3d	use metaclass to decorate errors (#9593 )	2021-11-03 15:29:32 +01:00
Paul O'Leary McCann	c1cc94a33a	Fix typo about receptive field size (#9564 )	2021-11-03 15:16:55 +01:00
Adriane Boyd	e06bbf72a4	Fix tok2vec-less textcat generation in website quickstart (#9610 )	2021-11-03 15:11:07 +01:00
Adriane Boyd	db0d8c56d0	Add test for Language.pipe as_tuples with custom error handlers (#9608 ) * make nlp.pipe() return None docs when no exceptions are (re-)raised during error handling * Remove changes other than as_tuples test * Only check warning count for one process * Fix types * Format Co-authored-by: Xi Bai <xi.bai.ed@gmail.com>	2021-11-03 10:57:34 +01:00
Adriane Boyd	6eee024ff6	Pickle Doc._context (#9603 )	2021-11-03 09:14:29 +01:00
Adriane Boyd	61daac54e4	Serialize _context separately in multiprocessing pipe (#9597 ) * Serialize _context with Doc * Revert "Serialize _context with Doc" This reverts commit `161f1fac91`. * Serialize Doc._context separately for multiprocessing pipe	2021-11-03 07:51:53 +01:00
Adriane Boyd	5a979137a7	Set as_tuples on Doc during processing (#9592 ) * Set as_tuples on Doc during processing * Fix types * Format	2021-11-02 15:08:22 +01:00
Adriane Boyd	4d5db737e9	Revert "Temporarily skip compat tests (#9594 )" This reverts commit `667572adca`.	2021-11-02 14:24:06 +01:00
Adriane Boyd	667572adca	Temporarily skip compat tests (#9594 )	2021-11-02 14:10:48 +01:00
Lj Miranda	f1bc655a38	Add initial Tagalog (tl) tests (#9582 ) * Add tl_tokenizer to test fixtures * Add tagalog tests	2021-11-02 08:35:49 +01:00
Adriane Boyd	bb26550e22	Fix StaticVectors after floret+mypy merge (#9566 )	2021-10-29 16:25:43 +02:00
Adriane Boyd	322635e371	Set version to v3.2.0 (#9565 )	2021-10-29 15:22:40 +02:00
Adriane Boyd	2d430958e1	Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-v3.2-3	2021-10-29 12:18:15 +02:00
Paul O'Leary McCann	006df1ae1f	Clarify error when words are of wrong type (#9541 ) * Clarify error when words are of wrong type See #9437 * Update docs * Use try/except * Apply suggestions from code review Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2021-10-29 12:08:40 +02:00
Paul O'Leary McCann	2fd8d616e7	Add docs section for spacy.cli.train.train (#9545 ) * Add section for spacy.cli.train.train * Add link from training page to train function * Ensure path in train helper * Update docs Co-authored-by: Ines Montani <ines@ines.io>	2021-10-29 10:36:34 +02:00
Adriane Boyd	5477453ea3	Docs for thinc-apple-ops (#9549 ) * Docs for thinc-apple-ops * Ignore thinc-apple-ops in reqs tests * Fix install quickstart * Add cupy cuda 113, 114 extras * Remove draft section Co-authored-by: Ines Montani <ines@ines.io>	2021-10-29 10:35:31 +02:00
Adriane Boyd	12974bf4d9	Add micro PRF for morph scoring (#9546 ) * Add micro PRF for morph scoring For pipelines where morph features are added by more than one component and a reference training corpus may not contain all features, a micro PRF score is more flexible than a simple accuracy score. An example is the reading and inflection features added by the Japanese tokenizer. * Use `morph_micro_f` as the default morph score for Japanese morphologizers. * Update docstring * Fix typo in docstring * Update Scorer API docs * Fix results type * Organize score list by attribute prefix	2021-10-29 10:29:29 +02:00
Adriane Boyd	c053f158c5	Add support for floret vectors (#8909 ) * Add support for fasttext-bloom hash-only vectors Overview: * Extend `Vectors` to have two modes: `default` and `ngram` * `default` is the default mode and equivalent to the current `Vectors` * `ngram` supports the hash-only ngram tables from `fasttext-bloom` * Extend `spacy.StaticVectors.v2` to handle both modes with no changes for `default` vectors * Extend `spacy init vectors` to support ngram tables The `ngram` mode only supports vector tables produced by this fork of fastText, which adds an option to represent all vectors using only the ngram buckets table and which uses the exact same ngram generation algorithm and hash function (`MurmurHash3_x64_128`). `fasttext-bloom` produces an additional `.hashvec` table, which can be loaded by `spacy init vectors --fasttext-bloom-vectors`. https://github.com/adrianeboyd/fastText/tree/feature/bloom Implementation details: * `Vectors` now includes the `StringStore` as `Vectors.strings` so that the API can stay consistent for both `default` (which can look up from `str` or `int`) and `ngram` (which requires `str` to calculate the ngrams). * In ngram mode `Vectors` uses a default `Vectors` object as a cache since the ngram vectors lookups are relatively expensive. * The default cache size is the same size as the provided ngram vector table. * Once the cache is full, no more entries are added. The user is responsible for managing the cache in cases where the initial documents are not representative of the texts. * The cache can be resized by setting `Vectors.ngram_cache_size` or cleared with `vectors._ngram_cache.clear()`. * The API ends up a bit split between methods for `default` and for `ngram`, so functions that only make sense for `default` or `ngram` include warnings with custom messages suggesting alternatives where possible. * `Vocab.vectors` becomes a property so that the string stores can be synced when assigning vectors to a vocab. * `Vectors` serializes its own config settings as `vectors.cfg`. * The `Vectors` serialization methods have added support for `exclude` so that the `Vocab` can exclude the `Vectors` strings while serializing. Removed: * The `minn` and `maxn` options and related code from `Vocab.get_vector`, which does not work in a meaningful way for default vector tables. * The unused `GlobalRegistry` in `Vectors`. * Refactor to use reduce_mean Refactor to use reduce_mean and remove the ngram vectors cache. * Rename to floret * Rename to floret in error messages * Use --vectors-mode in CLI, vector init * Fix vectors mode in init * Remove unused var * Minor API and docstrings adjustments * Rename `--vectors-mode` to `--mode` in `init vectors` CLI * Rename `Vectors.get_floret_vectors` to `Vectors.get_batch` and support both modes. * Minor updates to Vectors docstrings. * Update API docs for Vectors and init vectors CLI * Update types for StaticVectors	2021-10-27 14:08:31 +02:00
Adriane Boyd	0c97ed2746	Rename ja morph features to Inflection and Reading (#9520 ) * Rename ja morph features to Inflection and Reading	2021-10-27 13:13:03 +02:00

1 2 3 4 5 ...

9012 Commits