spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-09-11 22:52:39 +03:00

Author	SHA1	Message	Date
Sofie Van Landeghem	869cc4ab0b	warn when an unsupported/unknown key is given to the dependency matcher (#12928 )	2023-08-22 09:03:35 +02:00
Connor Brinton	6dd56868de	📝 Fix formula for receptive field in docs (#12918 ) SpaCy's HashEmbedCNN layer performs convolutions over tokens to produce contextualized embeddings using a `MaxoutWindowEncoder` layer. These convolutions are implemented using Thinc's `expand_window` layer, which concatenates `window_size` neighboring sequence items on either side of the sequence item being processed. This is repeated across `depth` convolutional layers. For example, consider the sequence "ABCDE" and a `MaxoutWindowEncoder` layer with a context window of 1 and a depth of 2. We'll focus on the token "C". We can visually represent the contextual embedding produced for "C" as: ```mermaid flowchart LR A0(A<sub>0</sub>) B0(B<sub>0</sub>) C0(C<sub>0</sub>) D0(D<sub>0</sub>) E0(E<sub>0</sub>) B1(B<sub>1</sub>) C1(C<sub>1</sub>) D1(D<sub>1</sub>) C2(C<sub>2</sub>) A0 --> B1 B0 --> B1 C0 --> B1 B0 --> C1 C0 --> C1 D0 --> C1 C0 --> D1 D0 --> D1 E0 --> D1 B1 --> C2 C1 --> C2 D1 --> C2 ``` Described in words, this graph shows that before the first layer of the convolution, the "receptive field" centered at each token consists only of that same token. That is to say, that we have a receptive field of 1. The first layer of the convolution adds one neighboring token on either side to the receptive field. Since this is done on both sides, the receptive field increases by 2, giving the first layer a receptive field of 3. The second layer of the convolutions adds an _additional_ neighboring token on either side to the receptive field, giving a final receptive field of 5. However, this doesn't match the formula currently given in the docs, which read: > The receptive field of the CNN will be > `depth * (window_size * 2 + 1)`, so a 4-layer network with a window > size of `2` will be sensitive to 20 words at a time. Substituting in our depth of 2 and window size of 1, this formula gives us a receptive field of: ``` depth * (window_size * 2 + 1) = 2 * (1 * 2 + 1) = 2 * (2 + 1) = 2 * 3 = 6 ``` This not only doesn't match our computations from above, it's also an even number! This is suspicious, since the receptive field is supposed to be centered on a token, and not between tokens. Generally, this formula results in an even number for any even value of `depth`. The error in this formula is that the adjustment for the center token is multiplied by the depth, when it should occur only once. The corrected formula, `depth * window_size * 2 + 1`, gives the correct value for our small example from above: ``` depth * window_size * 2 + 1 = 2 * 1 * 2 + 1 = 4 + 1 = 5 ``` These changes update the docs to correct the receptive field formula and the example receptive field size.	2023-08-21 10:52:32 +02:00
Adriane Boyd	198488ee86	Extend to weasel v0.3 (#12908 ) * Extend to weasel v0.3 * Clean up unused imports in test_cli	2023-08-16 17:36:53 +02:00
Adriane Boyd	76a9f9c6c6	Docs: clarify abstract spacy.load examples (#12889 )	2023-08-16 17:28:34 +02:00
William Mattingly	64b8ee2dbe	Update universe.json (#12904 ) * Update universe.json added hobbit-spacy to the universe json * Update universe.json removed displacy from hobbit-spacy and added a default text.	2023-08-14 16:44:14 +02:00
denizcodeyaa	d50b8d51e2	Update examples.py (#12895 ) Add: example sentences to improve the Turkish model. Let's get the tr_web_core_sm out in the the world yaa	2023-08-11 15:38:06 +02:00
Adriane Boyd	6a4aa43164	Extend to thinc v8.2 (#12897 )	2023-08-11 13:05:46 +02:00
Adriane Boyd	9622c11529	Extend to weasel v0.2 (#12902 )	2023-08-11 10:59:51 +02:00
Adriane Boyd	6ef29c4115	Merge pull request #12901 from adrianeboyd/feature/spacy-transformers-v1.3-revert Revert "Extend to spacy-transformers v1.3.x (#12877)"	2023-08-10 16:43:10 +02:00
Adriane Boyd	060241a8d5	Revert "Extend to spacy-transformers v1.3.x (#12877 )" This reverts commit `e5773e0c69`.	2023-08-10 11:42:09 +02:00
Adriane Boyd	458bc5f45c	Set version to v3.6.1 (#12892 )	2023-08-08 15:04:13 +02:00
Adriane Boyd	c4e378df97	Update CuPy extras (#12890 ) * Add `cuda12x` for `cupy-cuda12x`. * Drop `cuda-autodetect` from quickstart, set default to `cuda11x` instead.	2023-08-08 12:58:28 +02:00
Adriane Boyd	245e2ddc25	Allow pydantic v2 using transitional v1 support (#12888 )	2023-08-08 11:27:28 +02:00
Adriane Boyd	45af8a5dcf	Update br tags (#12882 ) * Fix displacy br tag * Prefer <br>, also update package CLI	2023-08-04 10:52:41 +02:00
Sofie Van Landeghem	3b7faf4f5e	fix (#12881 )	2023-08-03 08:37:43 +02:00
Arman Mohammadi	07407e07ab	fix the regular expression matching on the full text (#12883 ) There was a mistake in the regex pattern which caused not matching all the desired tokens. The problem was that when we use r string literal prefix to suppose a raw text, we should not use two backslashes to demonstrate a backslash.	2023-08-02 16:52:26 +02:00
Adriane Boyd	e5773e0c69	Extend to spacy-transformers v1.3.x (#12877 )	2023-08-02 09:35:16 +02:00
Sofie Van Landeghem	0737443096	feat: add example stubs (3) (#12801 ) * feat: add example stubs * fix: add required annotations * fix: mypy issues * fix: use Py36-compatible Portocol * Minor reformatting * adding further type specifications and removing internal methods * black formatting * widen type to iterable * add private methods that are being used by the built-in convertors * revert changes to corpus.py * fixes * fixes * fix typing of PlainTextCorpus --------- Co-authored-by: Basile Dura <basile@bdura.me> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2023-08-02 08:15:12 +02:00
Madeesh Kannan	222bd3c5b1	Display model's full base version string in incompatiblity warning (#12857 )	2023-08-02 08:06:41 +02:00
Adriane Boyd	0fe43f40f1	Support registered vectors (#12492 ) * Support registered vectors * Format * Auto-fill [nlp] on load from config and from bytes/disk * Only auto-fill [nlp] * Undo all changes to Language.from_disk * Expand BaseVectors These methods are needed in various places for training and vector similarity. * isort * More linting * Only fill [nlp.vectors] * Update spacy/vocab.pyx Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Revert changes to test related to auto-filling [nlp] * Add vectors registry * Rephrase error about vocab methods for vectors * Switch to dummy implementation for BaseVectors.to_ops * Add initial draft of docs * Remove example from BaseVectors docs * Apply suggestions from code review Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update website/docs/api/basevectors.mdx Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Fix type and lint bpemb example * Update website/docs/api/basevectors.mdx --------- Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2023-08-01 15:46:08 +02:00
Peter Baumgartner	a0a195688f	Tests for CLI app - `init config` generates `train`-able config (#12173 ) * remove migration support form * initial test commit * add fixture * add combo test * pull out parameter example data * fix formatting on examples * remove unused import * remove unncessary fmt:off instructions * only set logger level if verbose flag is explicitly set --------- Co-authored-by: svlandeg <svlandeg@github.com>	2023-07-31 14:45:04 +02:00
Andy Friedman	186889ec9c	added entry for SaysWho (#12828 ) * Update universe.json added entry for Sayswho * Update universe.json updated sayswho entry * Update universe.json * Update website/meta/universe.json * Update website/meta/universe.json --------- Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2023-07-31 10:52:32 +02:00
Sofie Van Landeghem	c9e9dccf79	Add displaCy data structures to docs (2) (#12875 ) * Add data structures to docs * Adjusted descriptions for more consistency * Add _optional_ flag to parameters * Add tests and adjust optional title key in doc * Add title to dep visualizations * fix typo --------- Co-authored-by: thomashacker <EdwardSchmuhl@web.de>	2023-07-31 10:47:57 +02:00
Victoria	49055ed7c8	Add cli for finding locations of registered func (#12757 ) * Add cli for finding locations of registered func * fixes: naming and typing * isort * update naming * remove to find-function * remove file:// bit * use registry name if given and exit gracefully if a registry was not found * clean up failure msg * specify registry_name options * mypy fixes * return location for internal usage * add documentation * more mypy fixes * clean up example * add section to menu * add tests --------- Co-authored-by: svlandeg <svlandeg@github.com>	2023-07-31 09:39:00 +02:00
Adriane Boyd	9ffa5d8a15	Remove ray extra (#12870 )	2023-07-28 15:48:36 +02:00
Márton Kardos	51b9655470	Added OdyCy to spaCy Universe (#12826 ) * Added OdyCy to spaCy Universe * Replaced template tags Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> --------- Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2023-07-26 16:05:53 +02:00
Madeesh Kannan	98799d849e	`SpanCat`: Remove invalid `threshold` config argument (#12860 )	2023-07-26 13:56:31 +02:00
Adriane Boyd	f8f489bcd6	Switch from distutils to setuptools/sysconfig (#12853 ) Additionally remove outdated `is_new_osx` check and settings.	2023-07-24 16:58:27 +02:00
Victoria	e2b89012a2	Add spacy-llm docs to website (#12782 ) * initial commit * update for v0.4.0 * Apply suggestions from code review * Fix formatting * Apply suggestions from code review * Update website/docs/api/large-language-models.mdx * Update website/docs/api/large-language-models.mdx * update usage page * Apply suggestions from review * Apply suggestions from review * fix links * fix relative links * Apply suggestions from code review Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Apply suggestions from review * Add section on Llama 2. Format. --------- Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2023-07-24 14:44:47 +02:00
Adriane Boyd	1d216a7ea6	Update README for v3.6 (#12844 ) * Update most recent release * Switch from azure to GHA CI tests badge * Remove link to survey * Format	2023-07-24 10:41:04 +02:00
Adriane Boyd	5888afa884	Update numpy build constraints for numpy 1.25 (#12839 ) * Update numpy build constraints for numpy 1.25 Starting in numpy 1.25 (see https://github.com/numpy/numpy/releases/tag/v1.25.0), the numpy C API is backwards-compatible by default. For python 3.9+, we should be able to drop the specific numpy build requirements and use `numpy>=1.25`, which is currently backwards-compatible to `numpy>=1.19`. In the future, the python <3.9 requirements could be dropped and the lower numpy pin could correspond to the oldest supported version for the current lower python pin. * Turn off fail-fast * Revert "Turn off fail-fast" This reverts commit `4306f516bc`. * Update for python 3.6 * Fix typo	2023-07-24 10:32:56 +02:00
Jacobo Myerston	4f8daa4f00	Add Left and Right Pointing Angle Brackets as punctuation to ancient Greek (#12829 ) * Update universe.json * Update universe.json add some missing commas in the greCy's description. * Update punctuation.py Add mathematical left and right angle brackets as punctuation for ancient Greek for better tokenization.	2023-07-20 11:16:01 +02:00
Sofie Van Landeghem	ea54d1775a	Merge pull request #12840 from svlandeg/sync_develop Sync develop	2023-07-19 13:12:51 +02:00
svlandeg	79ec68f01b	Merge branch 'upstream_master' into sync_develop	2023-07-19 12:08:52 +02:00
Basile Dura	b0228d8ea6	ci: add cython linter (#12694 ) * chore: add cython-linter dev dependency * fix: lexeme.pyx * fix: morphology.pxd * fix: tokenizer.pxd * fix: vocab.pxd * fix: morphology.pxd (line length) * ci: add cython-lint * ci: fix cython-lint call * Fix kb/candidate.pyx. * Fix kb/kb.pyx. * Fix kb/kb_in_memory.pyx. * Fix kb. * Fix training/ partially. * Fix training/. Ignore trailing whitespaces and too long lines. * Fix ml/. * Fix matcher/. * Fix pipeline/. * Fix tokens/. * Fix build errors. Fix vocab.pyx. * Fix cython-lint install and run. * Fix lexeme.pyx, parts_of_speech.pxd, vectors.pyx. Temporarily disable cython-lint execution. * Fix attrs.pyx, lexeme.pyx, symbols.pxd, isort issues. * Make cython-lint install conditional. Fix tokenizer.pyx. * Fix remaining files. Reenable cython-lint check. * Readded parentheses. * Fix test_build_dependencies(). * Add explanatory comment to cython-lint execution. --------- Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com>	2023-07-19 12:03:31 +02:00
Adriane Boyd	1509c96694	Clean up unused code in Language (#12836 ) Follow-up to #12701.	2023-07-18 14:10:30 +02:00
Adriane Boyd	6bf7c65329	Update matcher pattern validation tests (#12835 ) - parametrize over individual token patterns (as originally intended, as far as I can tell) - add a test for lowercase `in` in patterns	2023-07-18 10:00:07 +02:00
Adriane Boyd	95075298f5	Update pex Makefile defaults (#12832 ) * Update pex Makefile defaults - switch to python 3.8 - only install spacy-lookups-data for extra packages * Update website for pex defaults	2023-07-18 09:29:04 +02:00
Ian Thompson	ef20e114e0	Typo fix in `Language.replace_listeners` docs (#12823 ) * modified: spacy/language.py - corrected typo in docstring for :method:`Language.replace_listeners` - added noqa comment on unused local variable assignment in :method:`Language.from_config` as I wasn't sure if it should be unassigned modified: website/docs/api/language.mdx - corrected typo in `Language.replace_listeners` markdown * modified: spacy/language.py - removed noqa comment --------- Co-authored-by: Ian Thompson <ian.thompson@hrblock.com>	2023-07-14 09:45:54 +02:00
Connor Brinton	0566c3a166	🐛 Escape annotated HTML tags in span renderer (#12817 ) These changes add a missing call to `escape_html` in the displaCy span renderer. Previously span-annotated tokens would be inserted into the page markup without being escaped, resulting in potentially incorrect rendering. When I encountered this issue, it resulted in some docs and span underlines being superimposed on top of properly rendered docs and span underlines near the beginning of the visualization (due to an unescaped `<span>` tag).	2023-07-13 17:33:05 +02:00
Sofie Van Landeghem	ddffd09602	Trainable lemmatizer docs link (#12795 ) * add an anchor to the trainable lemmatizer section * add requirement for morphologizer,tagger to rule-based lemmatizer * morphologizer only	2023-07-07 15:18:16 +02:00
Adriane Boyd	1a55661cfb	Update website binder version to v3.6 (#12805 )	2023-07-07 10:52:33 +02:00
Adriane Boyd	41dba5bd34	Update max_length default in span finder docs (#12803 )	2023-07-07 10:17:41 +02:00
Sofie Van Landeghem	b1b20bf69d	Replace projects functionality with weasel (#12769 ) * Setting up weasel branch (#12456) * remove project-specific functionality * remove project-specific tests * remove project-specific schemas * remove project-specific information in about * remove project-specific functions in util.py * remove project-specific error strings * remove project-specific CLI commands * black formatting * restore some functions that are used beyond projects * remove project imports * remove imports * remove remote_storage tests * remove one more project unit test * update for PR 12394 * remove get_hash and get_checksum * remove upload_ and download_file methods * remove ensure_pathy * revert clumsy fingers * reinstate E970 * feat: use weasel as spacy project command (#12473) * feat: use weasel as spacy project command * build: use constrained requirement for weasel * feat: add weasel to the library requirements * build: update weasel to new version * build: use specific weasel tag * build: use weasel-0.1.0rc1 from PyPI * fix: remove weasel from requirements.txt * fix: requirements.txt and setup.cfg need to reflect each other * feat: remove legacy spacy project code * bump version * further merge fixes * isort --------- Co-authored-by: Basile Dura <bdura@users.noreply.github.com>	2023-07-07 09:10:27 +02:00
Sofie Van Landeghem	9e63006b12	Merge pull request #12800 from explosion/master_copy Sync develop with master	2023-07-07 08:44:19 +02:00
svlandeg	991bcc111e	disable tests until 3.7 models are available	2023-07-07 08:09:57 +02:00
Madeesh Kannan	d195923164	Set version to `3.7.0.dev0` (#12799 )	2023-07-06 18:29:03 +02:00
svlandeg	d26e4e0849	Revert "feat: add example stubs (#12679 )" This reverts commit `30bb34533a`.	2023-07-06 17:02:38 +02:00
Basile Dura	30bb34533a	feat: add example stubs (#12679 ) * feat: add example stubs * fix: add required annotations * fix: mypy issues * fix: use Py36-compatible Portocol * Minor reformatting --------- Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: svlandeg <svlandeg@github.com>	2023-07-06 16:49:43 +02:00
Adriane Boyd	6fc153a266	Merge pull request #12794 from adrianeboyd/chore/v3.6.0-2 Reenable compat+models tests for v3.6.0	2023-07-06 13:22:21 +02:00

... 3 4 5 6 7 ...

16203 Commits