spaCy

mirror of https://github.com/explosion/spaCy.git synced 2026-01-27 10:43:55 +03:00

Author	SHA1	Message	Date
Raphael Mitsch	830eba5426	Merge pull request #12994 from explosion/docs/llm_main Synch `llm_develop` with `llm_main`	2023-09-20 10:05:40 +02:00
Raphael Mitsch	163ec6fba8	Merge pull request #12993 from explosion/master Synch `llm_main` with `master`	2023-09-20 10:04:35 +02:00
Sofie Van Landeghem	8f0d6b0a8c	Fix in BertTokenizer docs (#12955 ) * fix BertWordPieceTokenizer constructor call * fix * Update website/docs/usage/linguistic-features.mdx --------- Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2023-09-13 13:21:58 +02:00
Sofie Van Landeghem	013762be41	Few spacy-llm doc fixes (#12969 ) * fix construction example * shorten task-specific factory list * small edits to HF models * small edit to API models * typo * fix space Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com> --------- Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com>	2023-09-08 11:35:38 +02:00
Sofie Van Landeghem	def7013eec	Docs for spacy-llm 0.5.0 (#12968 ) * Update incorrect example config. (#12893) * spacy-llm docs cleanup (#12945) * Shorten NER section * fix template references * simplify sections * set temperature to 0.0 in examples * condense model information * fix parameters for REST models * set temperature to 0.0 * spelling fix * trigger preview * fix quotes * add small note on noop.v1 * move up example noop config * set appropriate model example configs * explain config * fix Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com> --------- Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com> * Docs for ner.v3 and spancat.v3 spacy-llm tasks (#12949) * formatting * update usage table with NER.v3 * fix typo in links * v3 overview of parameters * add spancat.v3 * add further v3 explanations * remove TODO comment * few more small fixes * Add doc section on LLM + task factories (#12905) * Add section on LLM + task factories. * Apply suggestions from code review --------- Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * add default config to openai models (#12961) * Docs for spacy-llm 0.5.0 (#12967) * simplify Python example * simplify Python example * Refer only to latest OpenAI model versions from usage doc * Typo fix Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com> * clarify accuracy claim --------- Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com> --------- Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com>	2023-09-08 10:25:14 +02:00
Magdalena Aniol	cc78847688	fix training.batch_size example (#12963 )	2023-09-06 16:38:13 +02:00
Sofie Van Landeghem	6d1f6d9a23	Fix LLM usage example (#12950 ) * fix usage example * revert back to v2 to allow hot fix on main	2023-09-04 09:05:50 +02:00
Sofie Van Landeghem	5c1f9264c2	fix typo in link (#12948 ) * fix typo in link * fix REL.v1 parameter	2023-09-01 13:47:20 +02:00
David Berenstein	065ead4eed	updated `add_pipe` docs (#12947 )	2023-09-01 11:05:36 +02:00
vincent d warmerdam	3e4264899c	Update large-language-models.mdx (#12944 )	2023-08-30 11:58:14 +02:00
Ines Montani	52758e1afa	Add headers to netlify.toml [ci skip]	2023-08-30 11:55:23 +02:00
PD Hall	d8a32c1050	docs: fix ngram_range_suggester max_size description (#12939 )	2023-08-29 11:10:58 +02:00
Connor Brinton	6dd56868de	📝 Fix formula for receptive field in docs (#12918 ) SpaCy's HashEmbedCNN layer performs convolutions over tokens to produce contextualized embeddings using a `MaxoutWindowEncoder` layer. These convolutions are implemented using Thinc's `expand_window` layer, which concatenates `window_size` neighboring sequence items on either side of the sequence item being processed. This is repeated across `depth` convolutional layers. For example, consider the sequence "ABCDE" and a `MaxoutWindowEncoder` layer with a context window of 1 and a depth of 2. We'll focus on the token "C". We can visually represent the contextual embedding produced for "C" as: ```mermaid flowchart LR A0(A<sub>0</sub>) B0(B<sub>0</sub>) C0(C<sub>0</sub>) D0(D<sub>0</sub>) E0(E<sub>0</sub>) B1(B<sub>1</sub>) C1(C<sub>1</sub>) D1(D<sub>1</sub>) C2(C<sub>2</sub>) A0 --> B1 B0 --> B1 C0 --> B1 B0 --> C1 C0 --> C1 D0 --> C1 C0 --> D1 D0 --> D1 E0 --> D1 B1 --> C2 C1 --> C2 D1 --> C2 ``` Described in words, this graph shows that before the first layer of the convolution, the "receptive field" centered at each token consists only of that same token. That is to say, that we have a receptive field of 1. The first layer of the convolution adds one neighboring token on either side to the receptive field. Since this is done on both sides, the receptive field increases by 2, giving the first layer a receptive field of 3. The second layer of the convolutions adds an _additional_ neighboring token on either side to the receptive field, giving a final receptive field of 5. However, this doesn't match the formula currently given in the docs, which read: > The receptive field of the CNN will be > `depth * (window_size * 2 + 1)`, so a 4-layer network with a window > size of `2` will be sensitive to 20 words at a time. Substituting in our depth of 2 and window size of 1, this formula gives us a receptive field of: ``` depth * (window_size * 2 + 1) = 2 * (1 * 2 + 1) = 2 * (2 + 1) = 2 * 3 = 6 ``` This not only doesn't match our computations from above, it's also an even number! This is suspicious, since the receptive field is supposed to be centered on a token, and not between tokens. Generally, this formula results in an even number for any even value of `depth`. The error in this formula is that the adjustment for the center token is multiplied by the depth, when it should occur only once. The corrected formula, `depth * window_size * 2 + 1`, gives the correct value for our small example from above: ``` depth * window_size * 2 + 1 = 2 * 1 * 2 + 1 = 4 + 1 = 5 ``` These changes update the docs to correct the receptive field formula and the example receptive field size.	2023-08-21 10:52:32 +02:00
Adriane Boyd	76a9f9c6c6	Docs: clarify abstract spacy.load examples (#12889 )	2023-08-16 17:28:34 +02:00
William Mattingly	64b8ee2dbe	Update universe.json (#12904 ) * Update universe.json added hobbit-spacy to the universe json * Update universe.json removed displacy from hobbit-spacy and added a default text.	2023-08-14 16:44:14 +02:00
denizcodeyaa	d50b8d51e2	Update examples.py (#12895 ) Add: example sentences to improve the Turkish model. Let's get the tr_web_core_sm out in the the world yaa	2023-08-11 15:38:06 +02:00
Adriane Boyd	458bc5f45c	Set version to v3.6.1 (#12892 )	2023-08-08 15:04:13 +02:00
Adriane Boyd	c4e378df97	Update CuPy extras (#12890 ) * Add `cuda12x` for `cupy-cuda12x`. * Drop `cuda-autodetect` from quickstart, set default to `cuda11x` instead.	2023-08-08 12:58:28 +02:00
Adriane Boyd	245e2ddc25	Allow pydantic v2 using transitional v1 support (#12888 )	2023-08-08 11:27:28 +02:00
Adriane Boyd	45af8a5dcf	Update br tags (#12882 ) * Fix displacy br tag * Prefer <br>, also update package CLI	2023-08-04 10:52:41 +02:00
Sofie Van Landeghem	3b7faf4f5e	fix (#12881 )	2023-08-03 08:37:43 +02:00
Arman Mohammadi	07407e07ab	fix the regular expression matching on the full text (#12883 ) There was a mistake in the regex pattern which caused not matching all the desired tokens. The problem was that when we use r string literal prefix to suppose a raw text, we should not use two backslashes to demonstrate a backslash.	2023-08-02 16:52:26 +02:00
Madeesh Kannan	222bd3c5b1	Display model's full base version string in incompatiblity warning (#12857 )	2023-08-02 08:06:41 +02:00
Peter Baumgartner	a0a195688f	Tests for CLI app - `init config` generates `train`-able config (#12173 ) * remove migration support form * initial test commit * add fixture * add combo test * pull out parameter example data * fix formatting on examples * remove unused import * remove unncessary fmt:off instructions * only set logger level if verbose flag is explicitly set --------- Co-authored-by: svlandeg <svlandeg@github.com>	2023-07-31 14:45:04 +02:00
Andy Friedman	186889ec9c	added entry for SaysWho (#12828 ) * Update universe.json added entry for Sayswho * Update universe.json updated sayswho entry * Update universe.json * Update website/meta/universe.json * Update website/meta/universe.json --------- Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2023-07-31 10:52:32 +02:00
Sofie Van Landeghem	c9e9dccf79	Add displaCy data structures to docs (2) (#12875 ) * Add data structures to docs * Adjusted descriptions for more consistency * Add _optional_ flag to parameters * Add tests and adjust optional title key in doc * Add title to dep visualizations * fix typo --------- Co-authored-by: thomashacker <EdwardSchmuhl@web.de>	2023-07-31 10:47:57 +02:00
Victoria	49055ed7c8	Add cli for finding locations of registered func (#12757 ) * Add cli for finding locations of registered func * fixes: naming and typing * isort * update naming * remove to find-function * remove file:// bit * use registry name if given and exit gracefully if a registry was not found * clean up failure msg * specify registry_name options * mypy fixes * return location for internal usage * add documentation * more mypy fixes * clean up example * add section to menu * add tests --------- Co-authored-by: svlandeg <svlandeg@github.com>	2023-07-31 09:39:00 +02:00
Márton Kardos	51b9655470	Added OdyCy to spaCy Universe (#12826 ) * Added OdyCy to spaCy Universe * Replaced template tags Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> --------- Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2023-07-26 16:05:53 +02:00
Madeesh Kannan	98799d849e	`SpanCat`: Remove invalid `threshold` config argument (#12860 )	2023-07-26 13:56:31 +02:00
Adriane Boyd	f8f489bcd6	Switch from distutils to setuptools/sysconfig (#12853 ) Additionally remove outdated `is_new_osx` check and settings.	2023-07-24 16:58:27 +02:00
Victoria	e2b89012a2	Add spacy-llm docs to website (#12782 ) * initial commit * update for v0.4.0 * Apply suggestions from code review * Fix formatting * Apply suggestions from code review * Update website/docs/api/large-language-models.mdx * Update website/docs/api/large-language-models.mdx * update usage page * Apply suggestions from review * Apply suggestions from review * fix links * fix relative links * Apply suggestions from code review Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Apply suggestions from review * Add section on Llama 2. Format. --------- Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2023-07-24 14:44:47 +02:00
Adriane Boyd	1d216a7ea6	Update README for v3.6 (#12844 ) * Update most recent release * Switch from azure to GHA CI tests badge * Remove link to survey * Format	2023-07-24 10:41:04 +02:00
Basile Dura	b0228d8ea6	ci: add cython linter (#12694 ) * chore: add cython-linter dev dependency * fix: lexeme.pyx * fix: morphology.pxd * fix: tokenizer.pxd * fix: vocab.pxd * fix: morphology.pxd (line length) * ci: add cython-lint * ci: fix cython-lint call * Fix kb/candidate.pyx. * Fix kb/kb.pyx. * Fix kb/kb_in_memory.pyx. * Fix kb. * Fix training/ partially. * Fix training/. Ignore trailing whitespaces and too long lines. * Fix ml/. * Fix matcher/. * Fix pipeline/. * Fix tokens/. * Fix build errors. Fix vocab.pyx. * Fix cython-lint install and run. * Fix lexeme.pyx, parts_of_speech.pxd, vectors.pyx. Temporarily disable cython-lint execution. * Fix attrs.pyx, lexeme.pyx, symbols.pxd, isort issues. * Make cython-lint install conditional. Fix tokenizer.pyx. * Fix remaining files. Reenable cython-lint check. * Readded parentheses. * Fix test_build_dependencies(). * Add explanatory comment to cython-lint execution. --------- Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com>	2023-07-19 12:03:31 +02:00
Adriane Boyd	1509c96694	Clean up unused code in Language (#12836 ) Follow-up to #12701.	2023-07-18 14:10:30 +02:00
Adriane Boyd	6bf7c65329	Update matcher pattern validation tests (#12835 ) - parametrize over individual token patterns (as originally intended, as far as I can tell) - add a test for lowercase `in` in patterns	2023-07-18 10:00:07 +02:00
Adriane Boyd	95075298f5	Update pex Makefile defaults (#12832 ) * Update pex Makefile defaults - switch to python 3.8 - only install spacy-lookups-data for extra packages * Update website for pex defaults	2023-07-18 09:29:04 +02:00
Ian Thompson	ef20e114e0	Typo fix in `Language.replace_listeners` docs (#12823 ) * modified: spacy/language.py - corrected typo in docstring for :method:`Language.replace_listeners` - added noqa comment on unused local variable assignment in :method:`Language.from_config` as I wasn't sure if it should be unassigned modified: website/docs/api/language.mdx - corrected typo in `Language.replace_listeners` markdown * modified: spacy/language.py - removed noqa comment --------- Co-authored-by: Ian Thompson <ian.thompson@hrblock.com>	2023-07-14 09:45:54 +02:00
Connor Brinton	0566c3a166	🐛 Escape annotated HTML tags in span renderer (#12817 ) These changes add a missing call to `escape_html` in the displaCy span renderer. Previously span-annotated tokens would be inserted into the page markup without being escaped, resulting in potentially incorrect rendering. When I encountered this issue, it resulted in some docs and span underlines being superimposed on top of properly rendered docs and span underlines near the beginning of the visualization (due to an unescaped `<span>` tag).	2023-07-13 17:33:05 +02:00
Sofie Van Landeghem	ddffd09602	Trainable lemmatizer docs link (#12795 ) * add an anchor to the trainable lemmatizer section * add requirement for morphologizer,tagger to rule-based lemmatizer * morphologizer only	2023-07-07 15:18:16 +02:00
Adriane Boyd	1a55661cfb	Update website binder version to v3.6 (#12805 )	2023-07-07 10:52:33 +02:00
Adriane Boyd	41dba5bd34	Update max_length default in span finder docs (#12803 )	2023-07-07 10:17:41 +02:00
svlandeg	d26e4e0849	Revert "feat: add example stubs (#12679 )" This reverts commit `30bb34533a`.	2023-07-06 17:02:38 +02:00
Basile Dura	30bb34533a	feat: add example stubs (#12679 ) * feat: add example stubs * fix: add required annotations * fix: mypy issues * fix: use Py36-compatible Portocol * Minor reformatting --------- Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: svlandeg <svlandeg@github.com>	2023-07-06 16:49:43 +02:00
Adriane Boyd	6fc153a266	Merge pull request #12794 from adrianeboyd/chore/v3.6.0-2 Reenable compat+models tests for v3.6.0	2023-07-06 13:22:21 +02:00
Adriane Boyd	4e19ec7eb8	Docs for v3.6.0 (#12792 ) * Docs for v3.6.0 * Add sl performance * Add da trf note	2023-07-06 12:58:25 +02:00
Adriane Boyd	76329e1dde	Revert "Temporarily skip download CLI related tests in CI" This reverts commit `46ce66021a`.	2023-07-06 12:48:06 +02:00
Adriane Boyd	a1191146f5	Revert "Temporarily skip tests for compat table" This reverts commit `dd5e00c735`.	2023-07-06 12:47:50 +02:00
Adriane Boyd	830dcca367	SpanFinder: set default max_length to 25 (#12791 ) When the default `max_length` is not set and there are longer training documents, it can be difficult to train and evaluate the span finder due to memory limits and the time it takes to evaluate a huge number of predicted spans.	2023-07-06 09:55:34 +02:00
Tom Aarsen	eab929361d	Use 'exclude' instead of 'disable' (#12783 ) as suggested by @svlandeg	2023-07-04 11:45:13 +02:00
Marcus Blättermann	bd239511a4	Fix problem with missing syntax highlighting languages causing runtime crash on the website (#12781 ) * Fix problem with universe pages using `docker` language * Fix problem with universe pages using `r` language * Add fallback, in case code language is unknown	2023-07-03 10:24:25 +02:00

1 2 3 4 5 ...

15995 Commits