spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-09-12 15:12:39 +03:00

Author	SHA1	Message	Date
Adriane Boyd	0f9d2b01fb	Set version v3.6.0.dev1 (#12703 )	2023-06-07 16:23:14 +02:00
kadarakos	c003aac29a	SpanFinder into spaCy from experimental (#12507 ) * span finder integrated into spacy from experimental * black * isort * black * default spankey constant * black * Update spacy/pipeline/spancat.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * rename * rename * max_length and min_length as Optional[int] and strict checking * black * mypy fix for integer type infinity * revert line order * implement all comparison operators for inf int * avoid two for loops over all docs by not precomputing * interleave thresholding with span creation * black * revert to not interleaving (relized its faster) * black * Update spacy/errors.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * update dosctring * enforce that the gold and predicted documents have the same text * new error for ensuring reference and predicted texts are the same * remove todo * adjust test * black * handle misaligned tokenization * return correct variable * failing overfit test * only use a single spans_key like in spancat * black * remove debug lines * typo * remove comment * remove near duplicate reduntant method * use the 'spans_key' variable name everywhere * Update spacy/pipeline/span_finder.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * flaky test fix suggestion, hand set bias terms * only test suggester and test result exhaustively * make it clear that the span_finder_suggester is more general (not specific to span_finder) * Update spacy/tests/pipeline/test_span_finder.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Apply suggestions from code review * remove question comment * move preset_spans_suggester test to spancat tests * Add docs and unify default configs for spancat and span finder * Add `allow_overlap=True` to span finder scorer * Fix offset bug in set_annotations * Ignore labels in span finder scorer * Format * Add span_finder to quickstart template * Move settings to self.cfg, store min/max unset as None * Remove debugging * Update docstrings and docs * Update spacy/pipeline/span_finder.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Fix imports --------- Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2023-06-07 15:52:28 +02:00
Basile Dura	c3c064ace4	fix: `InitializableComponent` type hints (#12692 ) * fix: InitializableComponent type hints * fix: avoid circular dependency * style: clean imports in language.py * style: use relative imports Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * fix: apply black --------- Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2023-06-02 14:29:52 +02:00
Adriane Boyd	c4112a1da3	Require that all SpanGroup spans are from the current doc (#12569 ) * Require that all SpanGroup spans are from the current doc The restriction on only adding spans from the current doc were already implemented for all operations except for `SpanGroup.__init__`. Initialize copied spans for `SpanGroup.copy` with `Doc.char_span` in order to validate the character offsets and to make it possible to copy spans between documents with differing tokenization. Currently there is no validation that the document texts are identical, but the span char offsets must be valid spans in the target doc, which prevents you from ending up with completely invalid spans. * Undo change in test_beam_overfitting_IO	2023-06-01 19:19:17 +02:00
Isabel Zimmerman	05df59fd4a	[DOCS] add vetiver to spacy universe (#12557 ) * add vetiver to spacy universe * remove image * update logo to render correctly in thumbnail * apply Basil's suggestion Co-authored-by: Basile Dura <bdura@users.noreply.github.com> * refer to the same model --------- Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Basile Dura <bdura@users.noreply.github.com>	2023-06-01 17:11:18 +02:00
Adriane Boyd	c936db2faf	Address numpy 1.25 deprecations in test suite (#12684 ) * Address upcoming numpy v1.25 deprecations in test suite * Temporarily test most recent numpy prerelease in CI * Revert "Temporarily test most recent numpy prerelease in CI" This reverts commit `d75a66e55e`.	2023-05-31 17:23:07 +02:00
Adriane Boyd	9b7a59c325	Revert "CI: Disable fail-fast (#12658 )" (#12676 ) This reverts commit `1f088cbf4a`.	2023-05-26 10:57:02 +02:00
Vinit Ravishankar	f0e0206b77	update universe for spacypdfreader (#12661 )	2023-05-23 13:28:48 +02:00
Adriane Boyd	1f088cbf4a	CI: Disable fail-fast (#12658 ) While the typing_extensions/pydantic `Literal` bugs are being sorted out, disable fail-fast so the rest of the CI is available for development purposes.	2023-05-23 10:48:06 +02:00
Basile Dura	6ea4155487	feat: add comparison operators in `span.pyi` (#12652 ) * feat: add comparison operators in span.pyi remove Cython-specific `__richcmp__` * fix: comparison operators should be defined for any other object	2023-05-23 08:50:37 +02:00
Victoria	6930a6bf45	Add spaCy VSCode extension materials (#12592 )	2023-05-19 14:38:53 +02:00
Basile Dura	95fd46b1dd	feat: add type hinting on SpanGroup.__iter__ (#12642 )	2023-05-17 14:20:00 +02:00
Adriane Boyd	df083f91a5	Add Malay to website languages (#12643 )	2023-05-17 13:13:43 +02:00
Sani	873c16a4df	Malay language support (#12602 ) * add malay lang * fix token len * black format * reformat conftest malay * remove exceptions not exist in dbp * format code	2023-05-17 12:45:21 +02:00
Lj Miranda	58779c24ef	Remove shorthand for output-file in spacy apply (#12636 ) The output-file argument is positional, so can't use a shorthand like -o.	2023-05-17 12:36:29 +02:00
David Berenstein	83b6f488cb	universe: Update examples Adept Augementation (#12620 ) * Update universe.json * chore: changed readme example as suggested by Vincent Warmerdam (koaning)	2023-05-15 14:09:33 +02:00
Adriane Boyd	3dc445df8d	Fix new tags in docs for v3.5.x (#12629 ) * Fix new tags in docs for v3.5.x * Fix new tag	2023-05-15 12:06:58 +02:00
Basile Dura	2dd8825f09	docs: add comment on `offset_x` argument (#12630 )	2023-05-15 11:42:47 +02:00
Basile Dura	f96b9e03df	build: bump typer version to accept >=0.3<0.10 (#12631 )	2023-05-15 08:06:58 +02:00
Adriane Boyd	3637148c4d	Add scorer option to return per-component scores (#12540 ) * Add scorer option to return per-component scores Add `per_component` option to `Language.evaluate` and `Scorer.score` to return scores keyed by `tokenizer` (hard-coded) or by component name. Add option to `evaluate` CLI to score by component. Per-component scores can only be saved to JSON. * Update help text and messages	2023-05-12 15:36:54 +02:00
Kenneth Enevoldsen	88680a6eed	docs: remove invalid huggingface-hub push argument (#12624 )	2023-05-12 09:40:28 +02:00
Adriane Boyd	b5af0fe836	Revert "Use Latin normalization for Serbian attrs (#12608 )" (#12621 ) This reverts commit `6f314f99c4`. We are reverting this until we can support this normalization more consistently across vectors, training corpora, and lemmatizer data.	2023-05-11 11:54:16 +02:00
royashcenazi	3252f6b13f	Parsigs universe 3 (#12617 ) * parsigs universe * added model installation explanation in the description * Update website/meta/universe.json Co-authored-by: Basile Dura <bdura@users.noreply.github.com> * added model installement instruction in the code example * added biomedical category --------- Co-authored-by: Basile Dura <bdura@users.noreply.github.com>	2023-05-10 13:49:51 +02:00
royashcenazi	a56ab98e3c	parsigs universe (#12616 ) * parsigs universe * added model installation explanation in the description * Update website/meta/universe.json Co-authored-by: Basile Dura <bdura@users.noreply.github.com> * added model installement instruction in the code example --------- Co-authored-by: Basile Dura <bdura@users.noreply.github.com>	2023-05-10 13:19:28 +02:00
David Berenstein	d11b549195	chore: added adept-augmentations to the spacy universe (#12609 ) * chore: added adept-augmentations to the spacy universe * Apply suggestions from code review Co-authored-by: Basile Dura <bdura@users.noreply.github.com> * Update universe.json --------- Co-authored-by: Basile Dura <bdura@users.noreply.github.com>	2023-05-10 13:16:16 +02:00
Patrick J. Burns	15f16db6ca	Fix typo (#12615 )	2023-05-09 15:52:34 +02:00
Patrick J. Burns	eb3960a15a	Add LatinCy models to universe.json (#12597 ) * Add LatinCy models to universe.json * Update website/meta/universe.json Add install code for LatinCy models to 'code_example' Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update LatinCy ‘code_example’ in website/meta/universe.json Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> --------- Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2023-05-09 12:02:45 +02:00
Adriane Boyd	1279b464bb	In initialize only calculate current vectors hash if needed (#12607 )	2023-05-08 16:51:58 +02:00
Adriane Boyd	6f314f99c4	Use Latin normalization for Serbian attrs (#12608 ) * Use Latin normalization for Serbian attrs Use Latin normalization for Serbian `NORM`, `PREFIX`, and `SUFFIX`. * Update NORMs in tokenizer exceptions and related tests * Add tests for all custom lex attrs * Remove unused imports	2023-05-08 12:33:56 +02:00
Adriane Boyd	cbc6bcf434	Merge pull request #12604 from adrianeboyd/chore/v3.6.0.dev0 Set version to v3.6.0.dev0	2023-05-08 10:05:15 +02:00
Adriane Boyd	46ce66021a	Temporarily skip download CLI related tests in CI	2023-05-08 09:17:33 +02:00
Adriane Boyd	fbd12eb4a4	Set version to v3.6.0.dev0	2023-05-08 09:10:35 +02:00
Adriane Boyd	dbc71ecd44	Remove #egg from download URLs (#12567 ) The current URLs will become invalid in pip 25.0. According to the pip docs, the egg= URLs are currently only needed for editable VCS installs.	2023-05-04 17:13:12 +02:00
Kenneth Enevoldsen	73698326df	Update inmemorylookupkb.mdx (#12586 ) Example does not refer to the in memory lookup	2023-05-02 12:51:13 +02:00
Lj Miranda	298e6036b7	Add spans in spacy benchmark (#12575 ) * Add spans in spacy benchmark The current implementation of spaCy benchmark accuracy / spacy evaluate doesn't include the "spans" type, so calling the command doesn't render the HTML displaCy file needed. This PR attempts to fix that by creating a new parameter for "spans" and calling the appropriate displaCy value. * Reformat file with black * Add tests for evaluate * Fix spans -> span for displacy style * Update test to check render instead * Update source so mypy passes * Add parser information to avoid warnings	2023-04-28 14:32:52 +02:00
Adriane Boyd	6817e3d372	CI: Only run test suite once with thinc-apple-ops for macos python 3.11 (#12436 ) * CI: Only run test suite once with thinc-apple-ops for macos python 3.11 * Adjust syntax * Try alternate syntax * Try alternate syntax * Try alternate syntax	2023-04-28 14:29:51 +02:00
kadarakos	34d1164b0e	Spancat speed improvement (#12577 ) * avoid nesting then flattening * mypy fix * Apply suggestions from code review * Add type for indices * Run full matrix for mypy * Add back modified type: ignore * Revert "Run full matrix for mypy" This reverts commit `e218873d04`. --------- Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2023-04-27 15:27:13 +02:00
Victoria	a8dfc66135	Add spacy-wasm to universe (#12572 ) * add spacy-wasm to universe * add tag	2023-04-26 14:18:40 +02:00
moxley01	070fa16545	add spacysee project (#12568 )	2023-04-25 12:30:19 +02:00
Adriane Boyd	68da580a4c	CI: Disable Azure (#12560 )	2023-04-21 15:05:53 +02:00
Daniël de Kok	8a5814bf2c	Add distillation loop (#12542 ) * Add distillation initialization and loop * Fix up configuration keys * Add docstring * Type annotations * init_nlp_distill -> init_nlp_student * Do not resolve dot name distill corpus in initialization (Since we don't use it.) * student: do not request use of optimizer in student pipe We apply finish up the updates once in the training loop instead. Also add the necessary logic to `Language.distill` to mirror `Language.update`. * Correctly determine sort key in subdivide_batch * Fix _distill_loop docstring wrt. stopping condition * _distill_loop: fix distill_data docstring Make similar changes in train_while_improving, since it also had incorrect types and missing type annotations. * Move `set_{gpu_allocator,seed}_from_config` to spacy.util * Update Language.update docs for the sgd argument * Type annotation Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com> --------- Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>	2023-04-21 13:49:40 +02:00
Victoria	e115408514	remove survey link (#12559 )	2023-04-21 10:22:26 +02:00
Patrick J. Burns	ab4ba04c32	Update LatinDefaults for lang 'la' (#12538 ) * Add noun chunking to la syntax iterators * Expand list of numeral, ordinal words * Expand abbreviations in la tokenizer_exceptions * Add example sents * Update spacy/lang/la/syntax_iterators.py Reorganize la syntax iterators Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Minor updates based on review * fix call --------- Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2023-04-20 16:55:40 +02:00
Adriane Boyd	b60b027927	Add default option to MorphAnalysis.get (#12545 ) * Add default to MorphAnalysis.get Similar to `dict`, allow a `default` option for `MorphAnalysis.get` for the user to provide a default return value if the field is not found. The default return value remains `[]`, which is not the same as `dict.get`, but is already established as this method's default return value with the return type `List[str]`. However the new `default` option does not enforce that the user-provided default is actually `List[str]`. * Restore test case	2023-04-20 14:06:32 +02:00
Adriane Boyd	dc0a1a9808	Load exceptions last in Tokenizer.from_bytes (#12553 ) In `Tokenizer.from_bytes`, the exceptions should be loaded last so that they are only processed once as part of loading the model. The exceptions are tokenized as phrase matcher patterns in the background and the internal tokenization needs to be synced with all the remaining tokenizer settings. If the exceptions are not loaded last, there are speed regressions for `Tokenizer.from_bytes/disk` vs. `Tokenizer.add_special_case` as the caches are reloaded more than necessary during deserialization.	2023-04-20 11:30:34 +02:00
Sofie Van Landeghem	8e6a3d58d8	fix typo (#12543 )	2023-04-19 10:59:33 +02:00
TAN Long	923d24e885	perf(REL_OP): Replace some token.children with token.rights or token.lefts (#12528 ) Co-authored-by: Tan Long <tanloong@foxmail.com>	2023-04-17 13:16:34 +02:00
TAN Long	119f959218	docs(REL_OP): modify docs for REL_OPs to match Semgrex's update on CoreNLP v4.5.2 (#12531 ) Co-authored-by: Tan Long <tanloong@foxmail.com>	2023-04-17 13:14:01 +02:00
andyjessen	02259fa195	Add category to spaCy project (#12506 ) ScispaCy fits within biomedical domain. Consider adding this category.	2023-04-07 15:31:04 +02:00
Adriane Boyd	5d0f48fe69	Enforce that Span.start/end(_char) remain valid and in sync (#12268 ) * Enforce that Span.start/end(_char) remain valid and in sync Allowing span attributes to be writable starting in v3 has made it possible for the internal `Span.start/end/start_char/end_char` to get out-of-sync or have invalid values. This checks that the values are valid and syncs the token and char offsets if any attributes are modified directly. It does not yet handle the case where the underlying doc is modified. * Format	2023-04-06 16:01:59 +02:00

1 2 3 4 5 ...

16081 Commits