spaCy

mirror of https://github.com/explosion/spaCy.git synced 2026-01-10 02:31:16 +03:00

Author	SHA1	Message	Date
Daniël de Kok	e2a3952de5	Add spacy.TextCatParametricAttention.v1 (#13201 ) * Add spacy.TextCatParametricAttention.v1 This layer provides is a simplification of the ensemble classifier that only uses paramteric attention. We have found empirically that with a sufficient amount of training data, using the ensemble classifier with BoW does not provide significant improvement in classifier accuracy. However, plugging in a BoW classifier does reduce GPU training and inference performance substantially, since it uses a GPU-only kernel. * Fix merge fallout	2024-01-02 10:03:06 +01:00
Daniël de Kok	7ebba86402	Add TextCatReduce.v1 (#13181 ) * Add TextCatReduce.v1 This is a textcat classifier that pools the vectors generated by a tok2vec implementation and then applies a classifier to the pooled representation. Three reductions are supported for pooling: first, max, and mean. When multiple reductions are enabled, the reductions are concatenated before providing them to the classification layer. This model is a generalization of the TextCatCNN model, which only supports mean reductions and is a bit of a misnomer, because it can also be used with transformers. This change also reimplements TextCatCNN.v2 using the new TextCatReduce.v1 layer. * Doc fixes Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Fully specify `TextCatCNN` <-> `TextCatReduce` equivalence * Move TextCatCNN docs to legacy, in prep for moving to spacy-legacy * Add back a test for TextCatCNN.v2 * Replace TextCatCNN in pipe configurations and templates * Add an infobox to the `TextCatReduce` section with an `TextCatCNN` anchor * Add last reduction (`use_reduce_last`) * Remove non-working TextCatCNN Netlify redirect * Revert layer changes for the quickstart * Revert one more quickstart change * Remove unused import * Fix docstring * Fix setting name in error message --------- Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2023-12-21 11:00:06 +01:00
Adriane Boyd	e467573550	Docs: update trf_data examples and pipeline design info (#13164 )	2023-12-04 15:15:54 +01:00
Daniël de Kok	da7ad97519	Update `TextCatBOW` to use the fixed `SparseLinear` layer (#13149 ) * Update `TextCatBOW` to use the fixed `SparseLinear` layer A while ago, we fixed the `SparseLinear` layer to use all available parameters: https://github.com/explosion/thinc/pull/754 This change updates `TextCatBOW` to `v3` which uses the new `SparseLinear_v2` layer. This results in a sizeable improvement on a text categorization task that was tested. While at it, this `spacy.TextCatBOW.v3` also adds the `length_exponent` option to make it possible to change the hidden size. Ideally, we'd just have an option called `length`. But the way that `TextCatBOW` uses hashes results in a non-uniform distribution of parameters when the length is not a power of two. * Replace TexCatBOW `length_exponent` parameter by `length` We now round up the length to the next power of two if it isn't a power of two. * Remove some tests for TextCatBOW.v2 * Fix missing import	2023-11-29 09:11:54 +01:00
Adriane Boyd	513bbd5fa3	Add preferred use of build for package CLI (#13109 ) Build with `build` if available. Warn and fall back to previous `setup.py`-based builds if `build` build fails.	2023-11-08 17:35:24 +01:00
Sofie Van Landeghem	a804b83a4b	Update llm docs to clarify task-specific factories (#13082 ) * fix typo * add examples to specify custom model for task-specific factory	2023-10-31 22:07:07 +01:00
Sofie Van Landeghem	48248c62b6	Clarify EL example in docs (#13071 ) * add comment that pipeline is a custom one * add link to NEL tutorial * prettier * revert prettier reformat * revert prettier reformat (2) * fix typo Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com> --------- Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com>	2023-10-31 21:58:29 +01:00
Raphael Mitsch	0c15876502	Fix spancat typo. (#13095 )	2023-10-31 13:45:10 +01:00
Raphael Mitsch	9deaac9786	Add note in docs on `score_weight` config if using a non-default `spans_key` for SpanCat (#13093 ) * Add note on score_weight if using a non-default span_key for SpanCat. * Fix formatting. * Fix formatting. * Fix typo. * Use warning infobox. * Fix infobox formatting.	2023-10-30 17:02:08 +01:00
Raphael Mitsch	d72029d9c8	Add binary examples for Textcat task in `spacy-llm` (#13051 ) * Add examples for binary classification. * Fix example. * Remove binary textcat example. Format. * Rephrase.	2023-10-11 12:23:38 +02:00
Ines Montani	b83f1e3724	Inline displaCy visualizations in docs (#13050 ) [ci skip]	2023-10-06 14:22:43 +02:00
Raphael Mitsch	be29216fe2	Merge pull request #13044 from explosion/docs/llm_main Sync `master` with `docs/llm_main`	2023-10-05 16:10:19 +02:00
Raphael Mitsch	1162fcf099	Add Mistral mentions. (#13037 )	2023-10-05 14:44:38 +02:00
Raphael Mitsch	862f8254e8	Add docs on Azure OpenAI support in `spacy-llm` (#13043 ) * Add gpt-3.5-turbo-instruct to list of supported OpenAI models. * Update `spacy-llm` task argument docs w.r.t. task refactoring (#12995) * Update task arguments w.r.t. task refactoring in 0.5.0. * Add disclaimer w.r.t. gated models/Llama 2. * Update website/docs/api/large-language-models.mdx * Update website/docs/api/large-language-models.mdx * Update docs w.r.t. PaLM support. (#13018) * Add info on spacy.Azure.v1. * Attempt to fix netlify check fails. * Attempt to fix netlify check fails. * Attempt to fix netlify check fails. * Attempt to fix netlify check fails. * Attempt to fix netlify check fails. * Attempt to fix netlify check fails. * Attempt to fix netlify check fails. * Attempt to fix netlify check fails. * Attempt to fix netlify check fails. * Format.	2023-10-05 13:18:27 +02:00
Raphael Mitsch	1dec138e61	Update docs w.r.t. PaLM support. (#13018 )	2023-10-05 08:50:41 +02:00
Adriane Boyd	6e54360a3d	Remove pathy dependency, update docs for cloudpathlib in Weasel (#13035 )	2023-10-05 08:50:22 +02:00
Raphael Mitsch	734826db79	Update `spacy-llm` task argument docs w.r.t. task refactoring (#12995 ) * Update task arguments w.r.t. task refactoring in 0.5.0. * Add disclaimer w.r.t. gated models/Llama 2. * Update website/docs/api/large-language-models.mdx * Update website/docs/api/large-language-models.mdx	2023-10-05 08:45:25 +02:00
Adriane Boyd	160e61772e	Docs for v3.7.0 (#13029 ) * Docs for v3.7.0 * Minor fixes * Extend Weasel notes * Minor edits * Update version in README	2023-10-01 21:40:07 +02:00
Adriane Boyd	406794a081	Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-v3.7-1	2023-09-28 15:09:06 +02:00
Madeesh Kannan	b4501db6f8	Update emoji library in rule-based matcher example (#13014 )	2023-09-25 18:20:30 +02:00
Adriane Boyd	ff4215f1c7	Drop support for python 3.6 (#13009 ) * Drop support for python 3.6 * Update docs	2023-09-25 14:48:38 +02:00
Adriane Boyd	935a5455b6	Docs: add new tag for evaluate CLI --spans-keys (#13013 )	2023-09-25 11:49:28 +02:00
Eliana Vornov	4e3360ad12	add --spans-key option for CLI spancat evaluation (#12981 ) * add span key option for CLI evaluation * Rephrase CLI help to refer to Doc.spans instead of spancat * Rephrase docs to refer to Doc.spans instead of spancat --------- Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2023-09-25 11:25:41 +02:00
Raphael Mitsch	bef9f63e13	Add gpt-3.5-turbo-instruct to list of supported OpenAI models.	2023-09-21 11:28:58 +02:00
Sofie Van Landeghem	8f0d6b0a8c	Fix in BertTokenizer docs (#12955 ) * fix BertWordPieceTokenizer constructor call * fix * Update website/docs/usage/linguistic-features.mdx --------- Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2023-09-13 13:21:58 +02:00
Sofie Van Landeghem	013762be41	Few spacy-llm doc fixes (#12969 ) * fix construction example * shorten task-specific factory list * small edits to HF models * small edit to API models * typo * fix space Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com> --------- Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com>	2023-09-08 11:35:38 +02:00
Sofie Van Landeghem	def7013eec	Docs for spacy-llm 0.5.0 (#12968 ) * Update incorrect example config. (#12893) * spacy-llm docs cleanup (#12945) * Shorten NER section * fix template references * simplify sections * set temperature to 0.0 in examples * condense model information * fix parameters for REST models * set temperature to 0.0 * spelling fix * trigger preview * fix quotes * add small note on noop.v1 * move up example noop config * set appropriate model example configs * explain config * fix Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com> --------- Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com> * Docs for ner.v3 and spancat.v3 spacy-llm tasks (#12949) * formatting * update usage table with NER.v3 * fix typo in links * v3 overview of parameters * add spancat.v3 * add further v3 explanations * remove TODO comment * few more small fixes * Add doc section on LLM + task factories (#12905) * Add section on LLM + task factories. * Apply suggestions from code review --------- Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * add default config to openai models (#12961) * Docs for spacy-llm 0.5.0 (#12967) * simplify Python example * simplify Python example * Refer only to latest OpenAI model versions from usage doc * Typo fix Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com> * clarify accuracy claim --------- Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com> --------- Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com>	2023-09-08 10:25:14 +02:00
Magdalena Aniol	cc78847688	fix training.batch_size example (#12963 )	2023-09-06 16:38:13 +02:00
Sofie Van Landeghem	6d1f6d9a23	Fix LLM usage example (#12950 ) * fix usage example * revert back to v2 to allow hot fix on main	2023-09-04 09:05:50 +02:00
Sofie Van Landeghem	5c1f9264c2	fix typo in link (#12948 ) * fix typo in link * fix REL.v1 parameter	2023-09-01 13:47:20 +02:00
vincent d warmerdam	3e4264899c	Update large-language-models.mdx (#12944 )	2023-08-30 11:58:14 +02:00
Vinit Ravishankar	c2303858e6	Documentation for spacy-curated-transformers (#12677 ) * initial * initial documentation run * fix typo * Remove mentions of Torchscript and quantization Both are disabled in the initial release of `spacy-curated-transformers`. * Fix `piece_encoder` entries * Remove `spacy-transformers`-specific warning * Fix duplicate entries in tables * Doc fixes Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Remove type aliases * Fix copy-paste typo * Change `debug pieces` version tag to `3.7` * Set curated transformers API version to `3.7` * Fix transformer listener naming * Add docs for `init fill-config-transformer` * Update CLI command invocation syntax * Update intro section of the pipeline component docs * Fix source URL * Add a note to the architectures section about the `init fill-config-transformer` CLI command * Apply suggestions from code review Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update CLI command name, args * Remove hyphen from the `curated-transformers.mdx` filename * Fix links * Remove placeholder text * Add text to the model/tokenizer loader sections * Fill in the `DocTransformerOutput` section * Formatting fixes * Add curated transformer page to API docs sidebar * More formatting fixes * Remove TODO comment * Remove outdated info about default config * Apply suggestions from code review Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Add link to HF model hub * `prettier` --------- Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2023-08-29 17:52:16 +02:00
PD Hall	d8a32c1050	docs: fix ngram_range_suggester max_size description (#12939 )	2023-08-29 11:10:58 +02:00
Connor Brinton	6dd56868de	📝 Fix formula for receptive field in docs (#12918 ) SpaCy's HashEmbedCNN layer performs convolutions over tokens to produce contextualized embeddings using a `MaxoutWindowEncoder` layer. These convolutions are implemented using Thinc's `expand_window` layer, which concatenates `window_size` neighboring sequence items on either side of the sequence item being processed. This is repeated across `depth` convolutional layers. For example, consider the sequence "ABCDE" and a `MaxoutWindowEncoder` layer with a context window of 1 and a depth of 2. We'll focus on the token "C". We can visually represent the contextual embedding produced for "C" as: ```mermaid flowchart LR A0(A<sub>0</sub>) B0(B<sub>0</sub>) C0(C<sub>0</sub>) D0(D<sub>0</sub>) E0(E<sub>0</sub>) B1(B<sub>1</sub>) C1(C<sub>1</sub>) D1(D<sub>1</sub>) C2(C<sub>2</sub>) A0 --> B1 B0 --> B1 C0 --> B1 B0 --> C1 C0 --> C1 D0 --> C1 C0 --> D1 D0 --> D1 E0 --> D1 B1 --> C2 C1 --> C2 D1 --> C2 ``` Described in words, this graph shows that before the first layer of the convolution, the "receptive field" centered at each token consists only of that same token. That is to say, that we have a receptive field of 1. The first layer of the convolution adds one neighboring token on either side to the receptive field. Since this is done on both sides, the receptive field increases by 2, giving the first layer a receptive field of 3. The second layer of the convolutions adds an _additional_ neighboring token on either side to the receptive field, giving a final receptive field of 5. However, this doesn't match the formula currently given in the docs, which read: > The receptive field of the CNN will be > `depth * (window_size * 2 + 1)`, so a 4-layer network with a window > size of `2` will be sensitive to 20 words at a time. Substituting in our depth of 2 and window size of 1, this formula gives us a receptive field of: ``` depth * (window_size * 2 + 1) = 2 * (1 * 2 + 1) = 2 * (2 + 1) = 2 * 3 = 6 ``` This not only doesn't match our computations from above, it's also an even number! This is suspicious, since the receptive field is supposed to be centered on a token, and not between tokens. Generally, this formula results in an even number for any even value of `depth`. The error in this formula is that the adjustment for the center token is multiplied by the depth, when it should occur only once. The corrected formula, `depth * window_size * 2 + 1`, gives the correct value for our small example from above: ``` depth * window_size * 2 + 1 = 2 * 1 * 2 + 1 = 4 + 1 = 5 ``` These changes update the docs to correct the receptive field formula and the example receptive field size.	2023-08-21 10:52:32 +02:00
Adriane Boyd	76a9f9c6c6	Docs: clarify abstract spacy.load examples (#12889 )	2023-08-16 17:28:34 +02:00
Sofie Van Landeghem	3b7faf4f5e	fix (#12881 )	2023-08-03 08:37:43 +02:00
Arman Mohammadi	07407e07ab	fix the regular expression matching on the full text (#12883 ) There was a mistake in the regex pattern which caused not matching all the desired tokens. The problem was that when we use r string literal prefix to suppose a raw text, we should not use two backslashes to demonstrate a backslash.	2023-08-02 16:52:26 +02:00
Adriane Boyd	0fe43f40f1	Support registered vectors (#12492 ) * Support registered vectors * Format * Auto-fill [nlp] on load from config and from bytes/disk * Only auto-fill [nlp] * Undo all changes to Language.from_disk * Expand BaseVectors These methods are needed in various places for training and vector similarity. * isort * More linting * Only fill [nlp.vectors] * Update spacy/vocab.pyx Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Revert changes to test related to auto-filling [nlp] * Add vectors registry * Rephrase error about vocab methods for vectors * Switch to dummy implementation for BaseVectors.to_ops * Add initial draft of docs * Remove example from BaseVectors docs * Apply suggestions from code review Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update website/docs/api/basevectors.mdx Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Fix type and lint bpemb example * Update website/docs/api/basevectors.mdx --------- Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2023-08-01 15:46:08 +02:00
Sofie Van Landeghem	c9e9dccf79	Add displaCy data structures to docs (2) (#12875 ) * Add data structures to docs * Adjusted descriptions for more consistency * Add _optional_ flag to parameters * Add tests and adjust optional title key in doc * Add title to dep visualizations * fix typo --------- Co-authored-by: thomashacker <EdwardSchmuhl@web.de>	2023-07-31 10:47:57 +02:00
Victoria	49055ed7c8	Add cli for finding locations of registered func (#12757 ) * Add cli for finding locations of registered func * fixes: naming and typing * isort * update naming * remove to find-function * remove file:// bit * use registry name if given and exit gracefully if a registry was not found * clean up failure msg * specify registry_name options * mypy fixes * return location for internal usage * add documentation * more mypy fixes * clean up example * add section to menu * add tests --------- Co-authored-by: svlandeg <svlandeg@github.com>	2023-07-31 09:39:00 +02:00
Madeesh Kannan	98799d849e	`SpanCat`: Remove invalid `threshold` config argument (#12860 )	2023-07-26 13:56:31 +02:00
Victoria	e2b89012a2	Add spacy-llm docs to website (#12782 ) * initial commit * update for v0.4.0 * Apply suggestions from code review * Fix formatting * Apply suggestions from code review * Update website/docs/api/large-language-models.mdx * Update website/docs/api/large-language-models.mdx * update usage page * Apply suggestions from review * Apply suggestions from review * fix links * fix relative links * Apply suggestions from code review Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Apply suggestions from review * Add section on Llama 2. Format. --------- Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2023-07-24 14:44:47 +02:00
Adriane Boyd	95075298f5	Update pex Makefile defaults (#12832 ) * Update pex Makefile defaults - switch to python 3.8 - only install spacy-lookups-data for extra packages * Update website for pex defaults	2023-07-18 09:29:04 +02:00
Ian Thompson	ef20e114e0	Typo fix in `Language.replace_listeners` docs (#12823 ) * modified: spacy/language.py - corrected typo in docstring for :method:`Language.replace_listeners` - added noqa comment on unused local variable assignment in :method:`Language.from_config` as I wasn't sure if it should be unassigned modified: website/docs/api/language.mdx - corrected typo in `Language.replace_listeners` markdown * modified: spacy/language.py - removed noqa comment --------- Co-authored-by: Ian Thompson <ian.thompson@hrblock.com>	2023-07-14 09:45:54 +02:00
Sofie Van Landeghem	ddffd09602	Trainable lemmatizer docs link (#12795 ) * add an anchor to the trainable lemmatizer section * add requirement for morphologizer,tagger to rule-based lemmatizer * morphologizer only	2023-07-07 15:18:16 +02:00
Adriane Boyd	41dba5bd34	Update max_length default in span finder docs (#12803 )	2023-07-07 10:17:41 +02:00
Adriane Boyd	4e19ec7eb8	Docs for v3.6.0 (#12792 ) * Docs for v3.6.0 * Add sl performance * Add da trf note	2023-07-06 12:58:25 +02:00
Daniël de Kok	57a230c6e4	Remove section about parallel training with Ray (#12770 ) The Ray integration is currently broken, having these docs around suggest that this functionality is currently available.	2023-06-28 17:09:57 +02:00
Adriane Boyd	fb0da3e097	Support custom token/lexeme attribute for vectors (#12625 ) * Support custom token/lexeme attribute for vectors * Fix imports * Back off to ORTH without Vectors.attr * Fallback if vectors.attr doesn't exist * Update docs	2023-06-28 09:43:14 +02:00
kadarakos	c003aac29a	SpanFinder into spaCy from experimental (#12507 ) * span finder integrated into spacy from experimental * black * isort * black * default spankey constant * black * Update spacy/pipeline/spancat.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * rename * rename * max_length and min_length as Optional[int] and strict checking * black * mypy fix for integer type infinity * revert line order * implement all comparison operators for inf int * avoid two for loops over all docs by not precomputing * interleave thresholding with span creation * black * revert to not interleaving (relized its faster) * black * Update spacy/errors.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * update dosctring * enforce that the gold and predicted documents have the same text * new error for ensuring reference and predicted texts are the same * remove todo * adjust test * black * handle misaligned tokenization * return correct variable * failing overfit test * only use a single spans_key like in spancat * black * remove debug lines * typo * remove comment * remove near duplicate reduntant method * use the 'spans_key' variable name everywhere * Update spacy/pipeline/span_finder.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * flaky test fix suggestion, hand set bias terms * only test suggester and test result exhaustively * make it clear that the span_finder_suggester is more general (not specific to span_finder) * Update spacy/tests/pipeline/test_span_finder.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Apply suggestions from code review * remove question comment * move preset_spans_suggester test to spancat tests * Add docs and unify default configs for spancat and span finder * Add `allow_overlap=True` to span finder scorer * Fix offset bug in set_annotations * Ignore labels in span finder scorer * Format * Add span_finder to quickstart template * Move settings to self.cfg, store min/max unset as None * Remove debugging * Update docstrings and docs * Update spacy/pipeline/span_finder.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Fix imports --------- Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2023-06-07 15:52:28 +02:00

1 2 3 4 5 ...

1885 Commits