spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-12-05 01:04:45 +03:00

Author	SHA1	Message	Date
Raphael Mitsch	7d6ae1b960	Fix type aliases.	2024-02-01 14:51:49 +01:00
Raphael Mitsch	78c72d3ab7	Merge branch 'main' into feature/docwise-generator-batching	2024-01-30 21:00:22 +01:00
Daniël de Kok	ce4ea5ffa7	Py_UNICODE is not compatible with 3.12	2024-01-24 13:08:56 +01:00
Daniël de Kok	82ef6783a8	Merge remote-tracking branch 'upstream/master' into maintenance/v4-merge-master-20240119	2024-01-24 09:09:01 +01:00
Daniël de Kok	81beaea70e	Merge remote-tracking branch 'upstream/master' into maintenance/v4-merge-master-20240119	2024-01-19 12:34:29 +01:00
maurice	c608baeecc	Fix typo in method name	2024-01-16 21:54:54 +01:00
Daniël de Kok	7351f6bbeb	Update thinc dependency to 9.0.0.dev4	2024-01-16 15:56:09 +01:00
Daniël de Kok	7ebba86402	Add TextCatReduce.v1 (#13181 ) * Add TextCatReduce.v1 This is a textcat classifier that pools the vectors generated by a tok2vec implementation and then applies a classifier to the pooled representation. Three reductions are supported for pooling: first, max, and mean. When multiple reductions are enabled, the reductions are concatenated before providing them to the classification layer. This model is a generalization of the TextCatCNN model, which only supports mean reductions and is a bit of a misnomer, because it can also be used with transformers. This change also reimplements TextCatCNN.v2 using the new TextCatReduce.v1 layer. * Doc fixes Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Fully specify `TextCatCNN` <-> `TextCatReduce` equivalence * Move TextCatCNN docs to legacy, in prep for moving to spacy-legacy * Add back a test for TextCatCNN.v2 * Replace TextCatCNN in pipe configurations and templates * Add an infobox to the `TextCatReduce` section with an `TextCatCNN` anchor * Add last reduction (`use_reduce_last`) * Remove non-working TextCatCNN Netlify redirect * Revert layer changes for the quickstart * Revert one more quickstart change * Remove unused import * Fix docstring * Fix setting name in error message --------- Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2023-12-21 11:00:06 +01:00
Daniël de Kok	9b36729cbd	Fix Cython lints	2023-12-18 20:02:15 +01:00
Daniël de Kok	42fe4edfd7	Add distillation tests with max cut size And fix endless loop when the max cut size is 0 or 1.	2023-12-08 20:38:01 +01:00
Daniël de Kok	e2591cda36	isort	2023-12-08 20:24:09 +01:00
Daniël de Kok	e5ec45cb7e	Revert "Merge the parser refactor into `v4` (#10940 )" This reverts commit `a183db3cef`.	2023-12-08 20:23:08 +01:00
Daniël de Kok	05803cfe76	Revert "Reimplement distillation with oracle cut size (#12214 )" This reverts commit `e27c60a702`.	2023-12-08 14:38:05 +01:00
Daniël de Kok	da7ad97519	Update `TextCatBOW` to use the fixed `SparseLinear` layer (#13149 ) * Update `TextCatBOW` to use the fixed `SparseLinear` layer A while ago, we fixed the `SparseLinear` layer to use all available parameters: https://github.com/explosion/thinc/pull/754 This change updates `TextCatBOW` to `v3` which uses the new `SparseLinear_v2` layer. This results in a sizeable improvement on a text categorization task that was tested. While at it, this `spacy.TextCatBOW.v3` also adds the `length_exponent` option to make it possible to change the hidden size. Ideally, we'd just have an option called `length`. But the way that `TextCatBOW` uses hashes results in a non-uniform distribution of parameters when the length is not a power of two. * Replace TexCatBOW `length_exponent` parameter by `length` We now round up the length to the next power of two if it isn't a power of two. * Remove some tests for TextCatBOW.v2 * Fix missing import	2023-11-29 09:11:54 +01:00
Sofie Van Landeghem	699dd8b3b7	Update __all__ fields (#13063 ) * update all for pipeline.init * add all in training.init * add all in kb.init * alphabetically	2023-10-16 10:17:47 +02:00
Adriane Boyd	538304948e	Remove profile=True from currently profiled cython	2023-09-28 17:09:41 +02:00
Adriane Boyd	55614d6799	Add profile=False to currently unprofiled cython	2023-09-28 17:09:41 +02:00
Adriane Boyd	245e2ddc25	Allow pydantic v2 using transitional v1 support (#12888 )	2023-08-08 11:27:28 +02:00
Adriane Boyd	2702db9fef	Recommend lookups tables from URLs or other loaders (#12283 ) * Recommend lookups tables from URLs or other loaders Shift away from the `lookups` extra (which isn't removed, just no longer mentioned) and recommend loading data from the `spacy-lookups-data` repo or other sources rather than the `spacy-lookups-data` package. If the tables can't be loaded from the `lookups` registry in the lemmatizer, show how to specify the tables in `[initialize]` rather than recommending the `spacy-lookups-data` package. * Add tests for some rule-based lemmatizers * Apply suggestions from code review Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> --------- Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2023-07-31 15:54:35 +02:00
Raphael Mitsch	25bce73461	Format.	2023-07-28 12:56:45 +02:00
Raphael Mitsch	645b525a08	Fix merge error & tests.	2023-07-28 12:55:09 +02:00
Raphael Mitsch	61b2215b0e	Format.	2023-07-27 16:29:14 +02:00
Raphael Mitsch	a2585333a9	Fix merge errors.	2023-07-27 16:27:59 +02:00
Raphael Mitsch	8aa59c4f65	Merge branch 'v4' into feature/docwise-generator-batching # Conflicts: # spacy/kb/kb.pyx # spacy/kb/kb_in_memory.pyx # spacy/ml/models/entity_linker.py # spacy/pipeline/entity_linker.py # spacy/tests/pipeline/test_entity_linker.py # website/docs/api/entitylinker.mdx	2023-07-27 14:28:06 +02:00
svlandeg	96f2e30c4b	cython fixes and cleanup	2023-07-19 17:41:29 +02:00
svlandeg	47a82c6164	merge fixes	2023-07-19 16:38:29 +02:00
svlandeg	0e3b6a87d6	Merge branch 'upstream_master' into sync_v4	2023-07-19 16:37:31 +02:00
Basile Dura	b0228d8ea6	ci: add cython linter (#12694 ) * chore: add cython-linter dev dependency * fix: lexeme.pyx * fix: morphology.pxd * fix: tokenizer.pxd * fix: vocab.pxd * fix: morphology.pxd (line length) * ci: add cython-lint * ci: fix cython-lint call * Fix kb/candidate.pyx. * Fix kb/kb.pyx. * Fix kb/kb_in_memory.pyx. * Fix kb. * Fix training/ partially. * Fix training/. Ignore trailing whitespaces and too long lines. * Fix ml/. * Fix matcher/. * Fix pipeline/. * Fix tokens/. * Fix build errors. Fix vocab.pyx. * Fix cython-lint install and run. * Fix lexeme.pyx, parts_of_speech.pxd, vectors.pyx. Temporarily disable cython-lint execution. * Fix attrs.pyx, lexeme.pyx, symbols.pxd, isort issues. * Make cython-lint install conditional. Fix tokenizer.pyx. * Fix remaining files. Reenable cython-lint check. * Readded parentheses. * Fix test_build_dependencies(). * Add explanatory comment to cython-lint execution. --------- Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com>	2023-07-19 12:03:31 +02:00
Adriane Boyd	830dcca367	SpanFinder: set default max_length to 25 (#12791 ) When the default `max_length` is not set and there are longer training documents, it can be difficult to train and evaluate the span finder due to memory limits and the time it takes to evaluate a huge number of predicted spans.	2023-07-06 09:55:34 +02:00
Adriane Boyd	337a360cc7	Use spans_ prefix for default span finder scores (#12753 )	2023-06-27 19:32:17 +02:00
Daniël de Kok	2468742cb8	isort all the things	2023-06-26 11:41:03 +02:00
Daniël de Kok	e2b70df012	Configure isort to use the Black profile, recursively isort the `spacy` module (#12721 ) * Use isort with Black profile * isort all the things * Fix import cycles as a result of import sorting * Add DOCBIN_ALL_ATTRS type definition * Add isort to requirements * Remove isort from build dependencies check * Typo	2023-06-14 17:48:41 +02:00
Daniël de Kok	4990cfefb4	spancat type fixes	2023-06-12 16:43:11 +02:00
Daniël de Kok	50c5e9a2dd	Merge remote-tracking branch 'upstream/master' into sync-v4-master-20230612	2023-06-12 15:57:10 +02:00
kadarakos	c003aac29a	SpanFinder into spaCy from experimental (#12507 ) * span finder integrated into spacy from experimental * black * isort * black * default spankey constant * black * Update spacy/pipeline/spancat.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * rename * rename * max_length and min_length as Optional[int] and strict checking * black * mypy fix for integer type infinity * revert line order * implement all comparison operators for inf int * avoid two for loops over all docs by not precomputing * interleave thresholding with span creation * black * revert to not interleaving (relized its faster) * black * Update spacy/errors.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * update dosctring * enforce that the gold and predicted documents have the same text * new error for ensuring reference and predicted texts are the same * remove todo * adjust test * black * handle misaligned tokenization * return correct variable * failing overfit test * only use a single spans_key like in spancat * black * remove debug lines * typo * remove comment * remove near duplicate reduntant method * use the 'spans_key' variable name everywhere * Update spacy/pipeline/span_finder.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * flaky test fix suggestion, hand set bias terms * only test suggester and test result exhaustively * make it clear that the span_finder_suggester is more general (not specific to span_finder) * Update spacy/tests/pipeline/test_span_finder.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Apply suggestions from code review * remove question comment * move preset_spans_suggester test to spancat tests * Add docs and unify default configs for spancat and span finder * Add `allow_overlap=True` to span finder scorer * Fix offset bug in set_annotations * Ignore labels in span finder scorer * Format * Add span_finder to quickstart template * Move settings to self.cfg, store min/max unset as None * Remove debugging * Update docstrings and docs * Update spacy/pipeline/span_finder.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Fix imports --------- Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2023-06-07 15:52:28 +02:00
Raphael Mitsch	d1371d1043	Simplify doc loop in predict().	2023-04-24 22:13:05 +02:00
Raphael Mitsch	2c80db9371	Format.	2023-04-24 21:25:48 +02:00
Raphael Mitsch	ee5d7f4a33	Drop valid_ent_idx_per_doc.	2023-04-24 21:15:30 +02:00
Raphael Mitsch	7aa3758af6	Reformat imports in entity_linker.py.	2023-04-24 21:05:37 +02:00
Raphael Mitsch	49747697a2	Merge branch 'v4' into feature/docwise-generator-batching # Conflicts: # spacy/kb/kb.pyx # spacy/ml/models/entity_linker.py # spacy/pipeline/entity_linker.py # website/docs/api/inmemorylookupkb.mdx # website/docs/api/kb.mdx	2023-04-17 16:28:09 +02:00
Adriane Boyd	69e20ce03d	Fix pickle for ngram suggester (#12486 )	2023-03-31 13:43:51 +02:00
kadarakos	372a90885e	Fix spancat-singlelabel score (#12469 ) * debug argmax sort and add span scores * add missing tests for spanscores	2023-03-29 08:38:11 +02:00
Vinit Ravishankar	28de85737f	Tagger label smoothing (#12293 ) * add label smoothing * use True/False instead of floats * add entropy to debug data * formatting * docs * change test to check difference in distributions * Update website/docs/api/tagger.mdx Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update spacy/pipeline/tagger.pyx Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * bool -> float * update docs * fix seed * black * update tests to use label_smoothing = 0.0 * set default to 0.0, update quickstart * Update spacy/pipeline/tagger.pyx Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * update morphologizer, tagger test * fix morph docs * add url to docs --------- Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2023-03-22 12:17:56 +01:00
Raphael Mitsch	3102e2e27a	Entity linking: use `SpanGroup` instead of `Iterable[Span]` for mentions (#12344 ) * Convert Candidate from Cython to Python class. * Format. * Fix .entity_ typo in _add_activations() usage. * Change type for mentions to look up entity candidates for to SpanGroup from Iterable[Span]. * Update docs. * Update spacy/kb/candidate.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update doc string of BaseCandidate.__init__(). * Update spacy/kb/candidate.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Rename Candidate to InMemoryCandidate, BaseCandidate to Candidate. * Adjust Candidate to support and mandate numerical entity IDs. * Format. * Fix docstring and docs. * Update website/docs/api/kb.mdx Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Rename alias -> mention. * Refactor Candidate attribute names. Update docs and tests accordingly. * Refacor Candidate attributes and their usage. * Format. * Fix mypy error. * Update error code in line with v4 convention. * Reverse erroneous changes during merge. * Update return type in EL tests. * Re-add Candidate to setup.py. * Format updated docs. --------- Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2023-03-20 12:25:18 +01:00
Raphael Mitsch	e5be5d6092	Merge branch 'v4' into feature/docwise-generator-batching # Conflicts: # spacy/kb/kb.pyx # spacy/kb/kb_in_memory.pyx # spacy/ml/models/entity_linker.py # spacy/pipeline/entity_linker.py # spacy/tests/pipeline/test_entity_linker.py # website/docs/api/inmemorylookupkb.mdx # website/docs/api/kb.mdx	2023-03-20 10:50:54 +01:00
Raphael Mitsch	cb79af3a10	Fix merge leftovers.	2023-03-20 10:31:11 +01:00
Raphael Mitsch	73bdeb01e4	Merge branch 'refactor/el-candidates' into feature/docwise-generator-batching # Conflicts: # spacy/kb/candidate.py # spacy/kb/kb.pyx # spacy/kb/kb_in_memory.pyx # spacy/ml/models/entity_linker.py # spacy/pipeline/entity_linker.py # spacy/tests/pipeline/test_entity_linker.py # website/docs/api/inmemorylookupkb.mdx # website/docs/api/kb.mdx	2023-03-20 10:24:17 +01:00
Raphael Mitsch	9340eb8ad2	Introduce hierarchy for EL `Candidate` objects (#12341 ) * Convert Candidate from Cython to Python class. * Format. * Fix .entity_ typo in _add_activations() usage. * Update spacy/kb/candidate.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update doc string of BaseCandidate.__init__(). * Update spacy/kb/candidate.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Rename Candidate to InMemoryCandidate, BaseCandidate to Candidate. * Adjust Candidate to support and mandate numerical entity IDs. * Format. * Fix docstring and docs. * Update website/docs/api/kb.mdx Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Rename alias -> mention. * Refactor Candidate attribute names. Update docs and tests accordingly. * Refacor Candidate attributes and their usage. * Format. * Fix mypy error. * Update error code in line with v4 convention. * Update spacy/kb/candidate.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Updated error code. * Simplify interface for int/str representations. * Update website/docs/api/kb.mdx Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Rename 'alias' to 'mention'. * Port Candidate and InMemoryCandidate to Cython. * Remove redundant entry in setup.py. * Add abstract class check. * Drop storing mention. * Update spacy/kb/candidate.pxd Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Fix entity_id refactoring problems in docstrings. * Drop unused InMemoryCandidate._entity_hash. * Update docstrings. * Move attributes out of Candidate. * Partially fix alias/mention terminology usage. Convert Candidate to interface. * Remove prior_prob from supported properties in Candidate. Introduce KnowledgeBase.supports_prior_probs(). * Update docstrings related to prior_prob. * Update alias/mention usage in doc(strings). * Update spacy/ml/models/entity_linker.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update spacy/ml/models/entity_linker.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Mention -> alias renaming. Drop Candidate.mentions(). Drop InMemoryLookupKB.get_alias_candidates() from docs. * Update docstrings. * Fix InMemoryCandidate attribute names. * Update spacy/kb/kb.pyx Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update spacy/ml/models/entity_linker.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update W401 test. * Update spacy/errors.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update spacy/kb/kb.pyx Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Use Candidate output type for toy generators in the test suite to mimick best practices * fix docs * fix import --------- Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2023-03-20 00:34:35 +01:00
Raphael Mitsch	96b61d0671	Fix EL failure with sentence-crossing entities (#12398 ) * Add test reproducing EL failure in sentence-crossing entities. * Format. * Draft fix. * Format. * Fix case for len(ent.sents) == 1. * Format. * Format. * Format. * Fix mypy error. * Merge EL sentence crossing tests. * Remove unneeded sentencizer component. * Fix or ignore mypy issues in test. * Simplify ent.sents handling. * Format. Update assert in ent.sents handling. * Small rewrite --------- Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2023-03-14 22:02:49 +01:00
Raphael Mitsch	4a921766f1	Remove prior_prob from supported properties in Candidate. Introduce KnowledgeBase.supports_prior_probs().	2023-03-13 16:54:38 +01:00

1 2 3 4 5 ...

642 Commits