spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-11-13 14:25:52 +03:00

Author	SHA1	Message	Date
Raphael Mitsch	830eba5426	Merge pull request #12994 from explosion/docs/llm_main Synch `llm_develop` with `llm_main`	2023-09-20 10:05:40 +02:00
Raphael Mitsch	163ec6fba8	Merge pull request #12993 from explosion/master Synch `llm_main` with `master`	2023-09-20 10:04:35 +02:00
Sofie Van Landeghem	8f0d6b0a8c	Fix in BertTokenizer docs (#12955 ) * fix BertWordPieceTokenizer constructor call * fix * Update website/docs/usage/linguistic-features.mdx --------- Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2023-09-13 13:21:58 +02:00
Adriane Boyd	36d4767aca	Skip project remotes test for python 3.12 (#12980 ) `weasel` (using `cloudpathlib`) does not currently support remote paths for python 3.12.	2023-09-13 13:16:05 +02:00
Sofie Van Landeghem	013762be41	Few spacy-llm doc fixes (#12969 ) * fix construction example * shorten task-specific factory list * small edits to HF models * small edit to API models * typo * fix space Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com> --------- Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com>	2023-09-08 11:35:38 +02:00
Sofie Van Landeghem	def7013eec	Docs for spacy-llm 0.5.0 (#12968 ) * Update incorrect example config. (#12893) * spacy-llm docs cleanup (#12945) * Shorten NER section * fix template references * simplify sections * set temperature to 0.0 in examples * condense model information * fix parameters for REST models * set temperature to 0.0 * spelling fix * trigger preview * fix quotes * add small note on noop.v1 * move up example noop config * set appropriate model example configs * explain config * fix Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com> --------- Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com> * Docs for ner.v3 and spancat.v3 spacy-llm tasks (#12949) * formatting * update usage table with NER.v3 * fix typo in links * v3 overview of parameters * add spancat.v3 * add further v3 explanations * remove TODO comment * few more small fixes * Add doc section on LLM + task factories (#12905) * Add section on LLM + task factories. * Apply suggestions from code review --------- Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * add default config to openai models (#12961) * Docs for spacy-llm 0.5.0 (#12967) * simplify Python example * simplify Python example * Refer only to latest OpenAI model versions from usage doc * Typo fix Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com> * clarify accuracy claim --------- Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com> --------- Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com>	2023-09-08 10:25:14 +02:00
Magdalena Aniol	cc78847688	fix training.batch_size example (#12963 )	2023-09-06 16:38:13 +02:00
Sofie Van Landeghem	6d1f6d9a23	Fix LLM usage example (#12950 ) * fix usage example * revert back to v2 to allow hot fix on main	2023-09-04 09:05:50 +02:00
Sofie Van Landeghem	5c1f9264c2	fix typo in link (#12948 ) * fix typo in link * fix REL.v1 parameter	2023-09-01 13:47:20 +02:00
David Berenstein	065ead4eed	updated `add_pipe` docs (#12947 )	2023-09-01 11:05:36 +02:00
vincent d warmerdam	3e4264899c	Update large-language-models.mdx (#12944 )	2023-08-30 11:58:14 +02:00
Ines Montani	52758e1afa	Add headers to netlify.toml [ci skip]	2023-08-30 11:55:23 +02:00
Vinit Ravishankar	c2303858e6	Documentation for spacy-curated-transformers (#12677 ) * initial * initial documentation run * fix typo * Remove mentions of Torchscript and quantization Both are disabled in the initial release of `spacy-curated-transformers`. * Fix `piece_encoder` entries * Remove `spacy-transformers`-specific warning * Fix duplicate entries in tables * Doc fixes Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Remove type aliases * Fix copy-paste typo * Change `debug pieces` version tag to `3.7` * Set curated transformers API version to `3.7` * Fix transformer listener naming * Add docs for `init fill-config-transformer` * Update CLI command invocation syntax * Update intro section of the pipeline component docs * Fix source URL * Add a note to the architectures section about the `init fill-config-transformer` CLI command * Apply suggestions from code review Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update CLI command name, args * Remove hyphen from the `curated-transformers.mdx` filename * Fix links * Remove placeholder text * Add text to the model/tokenizer loader sections * Fill in the `DocTransformerOutput` section * Formatting fixes * Add curated transformer page to API docs sidebar * More formatting fixes * Remove TODO comment * Remove outdated info about default config * Apply suggestions from code review Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Add link to HF model hub * `prettier` --------- Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2023-08-29 17:52:16 +02:00
PD Hall	d8a32c1050	docs: fix ngram_range_suggester max_size description (#12939 )	2023-08-29 11:10:58 +02:00
Sofie Van Landeghem	869cc4ab0b	warn when an unsupported/unknown key is given to the dependency matcher (#12928 )	2023-08-22 09:03:35 +02:00
Connor Brinton	6dd56868de	📝 Fix formula for receptive field in docs (#12918 ) SpaCy's HashEmbedCNN layer performs convolutions over tokens to produce contextualized embeddings using a `MaxoutWindowEncoder` layer. These convolutions are implemented using Thinc's `expand_window` layer, which concatenates `window_size` neighboring sequence items on either side of the sequence item being processed. This is repeated across `depth` convolutional layers. For example, consider the sequence "ABCDE" and a `MaxoutWindowEncoder` layer with a context window of 1 and a depth of 2. We'll focus on the token "C". We can visually represent the contextual embedding produced for "C" as: ```mermaid flowchart LR A0(A<sub>0</sub>) B0(B<sub>0</sub>) C0(C<sub>0</sub>) D0(D<sub>0</sub>) E0(E<sub>0</sub>) B1(B<sub>1</sub>) C1(C<sub>1</sub>) D1(D<sub>1</sub>) C2(C<sub>2</sub>) A0 --> B1 B0 --> B1 C0 --> B1 B0 --> C1 C0 --> C1 D0 --> C1 C0 --> D1 D0 --> D1 E0 --> D1 B1 --> C2 C1 --> C2 D1 --> C2 ``` Described in words, this graph shows that before the first layer of the convolution, the "receptive field" centered at each token consists only of that same token. That is to say, that we have a receptive field of 1. The first layer of the convolution adds one neighboring token on either side to the receptive field. Since this is done on both sides, the receptive field increases by 2, giving the first layer a receptive field of 3. The second layer of the convolutions adds an _additional_ neighboring token on either side to the receptive field, giving a final receptive field of 5. However, this doesn't match the formula currently given in the docs, which read: > The receptive field of the CNN will be > `depth * (window_size * 2 + 1)`, so a 4-layer network with a window > size of `2` will be sensitive to 20 words at a time. Substituting in our depth of 2 and window size of 1, this formula gives us a receptive field of: ``` depth * (window_size * 2 + 1) = 2 * (1 * 2 + 1) = 2 * (2 + 1) = 2 * 3 = 6 ``` This not only doesn't match our computations from above, it's also an even number! This is suspicious, since the receptive field is supposed to be centered on a token, and not between tokens. Generally, this formula results in an even number for any even value of `depth`. The error in this formula is that the adjustment for the center token is multiplied by the depth, when it should occur only once. The corrected formula, `depth * window_size * 2 + 1`, gives the correct value for our small example from above: ``` depth * window_size * 2 + 1 = 2 * 1 * 2 + 1 = 4 + 1 = 5 ``` These changes update the docs to correct the receptive field formula and the example receptive field size.	2023-08-21 10:52:32 +02:00
Adriane Boyd	198488ee86	Extend to weasel v0.3 (#12908 ) * Extend to weasel v0.3 * Clean up unused imports in test_cli	2023-08-16 17:36:53 +02:00
Adriane Boyd	76a9f9c6c6	Docs: clarify abstract spacy.load examples (#12889 )	2023-08-16 17:28:34 +02:00
William Mattingly	64b8ee2dbe	Update universe.json (#12904 ) * Update universe.json added hobbit-spacy to the universe json * Update universe.json removed displacy from hobbit-spacy and added a default text.	2023-08-14 16:44:14 +02:00
denizcodeyaa	d50b8d51e2	Update examples.py (#12895 ) Add: example sentences to improve the Turkish model. Let's get the tr_web_core_sm out in the the world yaa	2023-08-11 15:38:06 +02:00
Adriane Boyd	6a4aa43164	Extend to thinc v8.2 (#12897 )	2023-08-11 13:05:46 +02:00
Adriane Boyd	9622c11529	Extend to weasel v0.2 (#12902 )	2023-08-11 10:59:51 +02:00
Adriane Boyd	6ef29c4115	Merge pull request #12901 from adrianeboyd/feature/spacy-transformers-v1.3-revert Revert "Extend to spacy-transformers v1.3.x (#12877)"	2023-08-10 16:43:10 +02:00
Adriane Boyd	060241a8d5	Revert "Extend to spacy-transformers v1.3.x (#12877 )" This reverts commit `e5773e0c69`.	2023-08-10 11:42:09 +02:00
Adriane Boyd	1b2d66f98e	Switch zh tokenizer default pkuseg_model to spacy_ontonotes (#12896 ) So that users can use `copy_from_base_model` for other segmenters without having to override an irrelevant `pkuseg_model` setting, switch the default `pkuseg_model` to `spacy_ontonotes`.	2023-08-09 10:55:52 +02:00
Adriane Boyd	458bc5f45c	Set version to v3.6.1 (#12892 )	2023-08-08 15:04:13 +02:00
Adriane Boyd	c4e378df97	Update CuPy extras (#12890 ) * Add `cuda12x` for `cupy-cuda12x`. * Drop `cuda-autodetect` from quickstart, set default to `cuda11x` instead.	2023-08-08 12:58:28 +02:00
Adriane Boyd	245e2ddc25	Allow pydantic v2 using transitional v1 support (#12888 )	2023-08-08 11:27:28 +02:00
Adriane Boyd	45af8a5dcf	Update br tags (#12882 ) * Fix displacy br tag * Prefer <br>, also update package CLI	2023-08-04 10:52:41 +02:00
Sofie Van Landeghem	3b7faf4f5e	fix (#12881 )	2023-08-03 08:37:43 +02:00
Arman Mohammadi	07407e07ab	fix the regular expression matching on the full text (#12883 ) There was a mistake in the regex pattern which caused not matching all the desired tokens. The problem was that when we use r string literal prefix to suppose a raw text, we should not use two backslashes to demonstrate a backslash.	2023-08-02 16:52:26 +02:00
Adriane Boyd	e5773e0c69	Extend to spacy-transformers v1.3.x (#12877 )	2023-08-02 09:35:16 +02:00
Sofie Van Landeghem	0737443096	feat: add example stubs (3) (#12801 ) * feat: add example stubs * fix: add required annotations * fix: mypy issues * fix: use Py36-compatible Portocol * Minor reformatting * adding further type specifications and removing internal methods * black formatting * widen type to iterable * add private methods that are being used by the built-in convertors * revert changes to corpus.py * fixes * fixes * fix typing of PlainTextCorpus --------- Co-authored-by: Basile Dura <basile@bdura.me> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2023-08-02 08:15:12 +02:00
Madeesh Kannan	222bd3c5b1	Display model's full base version string in incompatiblity warning (#12857 )	2023-08-02 08:06:41 +02:00
Adriane Boyd	0fe43f40f1	Support registered vectors (#12492 ) * Support registered vectors * Format * Auto-fill [nlp] on load from config and from bytes/disk * Only auto-fill [nlp] * Undo all changes to Language.from_disk * Expand BaseVectors These methods are needed in various places for training and vector similarity. * isort * More linting * Only fill [nlp.vectors] * Update spacy/vocab.pyx Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Revert changes to test related to auto-filling [nlp] * Add vectors registry * Rephrase error about vocab methods for vectors * Switch to dummy implementation for BaseVectors.to_ops * Add initial draft of docs * Remove example from BaseVectors docs * Apply suggestions from code review Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update website/docs/api/basevectors.mdx Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Fix type and lint bpemb example * Update website/docs/api/basevectors.mdx --------- Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2023-08-01 15:46:08 +02:00
Paul O'Leary McCann	b4e457d9fe	Accept multiple code files in all CLI commands (#12101 ) * Add support for multiple code files to all relevant commands Prior to this, only the package command supported multiple code files. * Update docs * Add debug data test, plus generic fixtures One tricky thing here: it's tempting to create the config by creating a pipeline in code, but that requires declaring the custom components here. However the CliRunner appears to be run in the same process or otherwise have access to our registry, so it works even without any code arguments. So it's necessary to avoid declaring the components in the tests. * Add debug config test and restructure The code argument imports the provided file. If it adds item to the registry, that affects global state, which CliRunner doesn't isolate. Since there's no standard way to remove things from the registry, this instead uses subprocess.run to run commands. * Use a more generic, parametrized test * Add output arg for assemble and pretrain Assemble and pretrain require an output argument. This commit adds assemble testing, but not pretrain, as that requires an actual trainable component, which is not currently in the test config. * Add evaluate test and some cleanup * Mark tests as slow * Revert argument name change * Apply suggestions from code review Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Format API CLI docs * isort * Fix imports in tests * isort * Undo changes to package CLI help * Fix python executable and lang code in test * Fix executable in another test --------- Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com>	2023-08-01 15:24:02 +02:00
Adriane Boyd	2702db9fef	Recommend lookups tables from URLs or other loaders (#12283 ) * Recommend lookups tables from URLs or other loaders Shift away from the `lookups` extra (which isn't removed, just no longer mentioned) and recommend loading data from the `spacy-lookups-data` repo or other sources rather than the `spacy-lookups-data` package. If the tables can't be loaded from the `lookups` registry in the lemmatizer, show how to specify the tables in `[initialize]` rather than recommending the `spacy-lookups-data` package. * Add tests for some rule-based lemmatizers * Apply suggestions from code review Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> --------- Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2023-07-31 15:54:35 +02:00
Peter Baumgartner	a0a195688f	Tests for CLI app - `init config` generates `train`-able config (#12173 ) * remove migration support form * initial test commit * add fixture * add combo test * pull out parameter example data * fix formatting on examples * remove unused import * remove unncessary fmt:off instructions * only set logger level if verbose flag is explicitly set --------- Co-authored-by: svlandeg <svlandeg@github.com>	2023-07-31 14:45:04 +02:00
Andy Friedman	186889ec9c	added entry for SaysWho (#12828 ) * Update universe.json added entry for Sayswho * Update universe.json updated sayswho entry * Update universe.json * Update website/meta/universe.json * Update website/meta/universe.json --------- Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2023-07-31 10:52:32 +02:00
Sofie Van Landeghem	c9e9dccf79	Add displaCy data structures to docs (2) (#12875 ) * Add data structures to docs * Adjusted descriptions for more consistency * Add _optional_ flag to parameters * Add tests and adjust optional title key in doc * Add title to dep visualizations * fix typo --------- Co-authored-by: thomashacker <EdwardSchmuhl@web.de>	2023-07-31 10:47:57 +02:00
Victoria	49055ed7c8	Add cli for finding locations of registered func (#12757 ) * Add cli for finding locations of registered func * fixes: naming and typing * isort * update naming * remove to find-function * remove file:// bit * use registry name if given and exit gracefully if a registry was not found * clean up failure msg * specify registry_name options * mypy fixes * return location for internal usage * add documentation * more mypy fixes * clean up example * add section to menu * add tests --------- Co-authored-by: svlandeg <svlandeg@github.com>	2023-07-31 09:39:00 +02:00
Adriane Boyd	9ffa5d8a15	Remove ray extra (#12870 )	2023-07-28 15:48:36 +02:00
Márton Kardos	51b9655470	Added OdyCy to spaCy Universe (#12826 ) * Added OdyCy to spaCy Universe * Replaced template tags Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> --------- Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2023-07-26 16:05:53 +02:00
Madeesh Kannan	98799d849e	`SpanCat`: Remove invalid `threshold` config argument (#12860 )	2023-07-26 13:56:31 +02:00
Adriane Boyd	f8f489bcd6	Switch from distutils to setuptools/sysconfig (#12853 ) Additionally remove outdated `is_new_osx` check and settings.	2023-07-24 16:58:27 +02:00
Victoria	e2b89012a2	Add spacy-llm docs to website (#12782 ) * initial commit * update for v0.4.0 * Apply suggestions from code review * Fix formatting * Apply suggestions from code review * Update website/docs/api/large-language-models.mdx * Update website/docs/api/large-language-models.mdx * update usage page * Apply suggestions from review * Apply suggestions from review * fix links * fix relative links * Apply suggestions from code review Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Apply suggestions from review * Add section on Llama 2. Format. --------- Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2023-07-24 14:44:47 +02:00
Sofie Van Landeghem	eaaac5a08c	Merge pull request #12842 from svlandeg/sync_v4 Sync v4 with latest from master and develop	2023-07-24 12:13:04 +02:00
Adriane Boyd	1d216a7ea6	Update README for v3.6 (#12844 ) * Update most recent release * Switch from azure to GHA CI tests badge * Remove link to survey * Format	2023-07-24 10:41:04 +02:00
Adriane Boyd	5888afa884	Update numpy build constraints for numpy 1.25 (#12839 ) * Update numpy build constraints for numpy 1.25 Starting in numpy 1.25 (see https://github.com/numpy/numpy/releases/tag/v1.25.0), the numpy C API is backwards-compatible by default. For python 3.9+, we should be able to drop the specific numpy build requirements and use `numpy>=1.25`, which is currently backwards-compatible to `numpy>=1.19`. In the future, the python <3.9 requirements could be dropped and the lower numpy pin could correspond to the oldest supported version for the current lower python pin. * Turn off fail-fast * Revert "Turn off fail-fast" This reverts commit `4306f516bc`. * Update for python 3.6 * Fix typo	2023-07-24 10:32:56 +02:00
Sofie Van Landeghem	f293386d3e	remove unnecessary line Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2023-07-20 14:08:29 +02:00

... 2 3 4 5 6 ...

16294 Commits