spaCy

mirror of https://github.com/explosion/spaCy.git synced 2026-01-26 10:14:07 +03:00

Author	SHA1	Message	Date
Paul O'Leary McCann	1b5aba9e22	Don't re-download installed models (#12188 ) * Don't re-download installed models When downloading a model, this checks if the same version of the same model is already installed. If it is then the download is skipped. This is necessary because pip uses the final download URL for its caching feature, but because of the way models are hosted on Github, their URLs change every few minutes. * Use importlib instead of meta.json * Use get_package_version * Add untested, disabled test --------- Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2023-01-31 11:31:17 +01:00
Daniël de Kok	6b07be2110	Add `Language.distill` (#12116 ) * Add `Language.distill` This method is the distillation counterpart of `Language.update`. It takes a teacher `Language` instance and distills the student pipes on the teacher pipes. * Apply suggestions from code review Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com> * Clarify that how Example is used in distillation * Update transition parser distill docstring for examples argument * Pass optimizer to `TrainablePipe.distill` * Annotate pipe before update As discussed internally, we want to let a pipe annotate before doing an update with gold/silver data. Otherwise, the output may be (too) informed by the gold/silver data. * Rename `component_map` to `student_to_teacher` * Better synopsis in `Language.distill` docstring * `name` -> `student_name` * Fix labels type in docstring * Mark distill test as slow * Fix `student_to_teacher` type in docs --------- Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>	2023-01-30 12:44:11 +01:00
Adriane Boyd	ec45f704b1	Drop python 3.6/3.7, remove unneeded compat (#12187 ) * Drop python 3.6/3.7, remove unneeded compat * Remove unused import * Minimal python 3.8+ docs updates	2023-01-27 15:48:20 +01:00
Sofie Van Landeghem	1678a98449	Merge pull request #12192 from adrianeboyd/chore/update-v4-from-master-5 Update v4 from master, format, update CI	2023-01-27 14:59:26 +01:00
Adriane Boyd	16609517f1	CI: Skip tests that require published pipelines	2023-01-27 08:37:02 +01:00
Adriane Boyd	fd911fe2af	Format	2023-01-27 08:29:46 +01:00
Adriane Boyd	8548d4d16e	Merge remote-tracking branch 'upstream/master' into update-v4-from-master-1	2023-01-27 08:29:09 +01:00
Peter Baumgartner	c68e6b8a96	`trainable_lemmatizer` in `debug data` (#11419 ) * WIP * rm ipython embeds * rm total * WIP * cleanup * cleanup + reword * rm component function * remove migration support form * fix reference dataset for dev data * additional fixes - set approach to identifying unique trees - adjust line length on messages - add logic for detecting docs without annotations * use 0 instead of none for no annotation * partial annotation support * initial tests for _compile_gold lemma attributes Using the example data from the edit tree lemmatizer tests for: - lemmatizer_trees - partial_lemma_annotations - n_low_cardinality_lemmas - no_lemma_annotations * adds output test for cli app * switch msg level * rm unclear uniqueness check * Revert "rm unclear uniqueness check" This reverts commit `6ea2b3524b`. * remove good message on uniqueness * formatting * use en_vocab fixture * clarify data set source in messages * remove unnecessary import Co-authored-by: svlandeg <svlandeg@github.com>	2023-01-26 17:36:50 +01:00
Daniël de Kok	8d69874afb	Add `spacy.PlainTextCorpusReader.v1` (#12122 ) * Add `spacy.PlainTextCorpusReader.v1` This is a corpus reader that reads plain text corpora with the following format: - UTF-8 encoding - One line per document. - Blank lines are ignored. It is useful for applications where we deal with very large corpora, such as distillation, and don't want to deal with the space overhead of serialized formats. Additionally, many large corpora already use such a text format, keeping the necessary preprocessing to a minimum. * Update spacy/training/corpus.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * docs: add version to `PlainTextCorpus` * Add docstring to registry function * Add plain text corpus tests * Only strip newline/carriage return * Add return type _string_to_tmp_file helper * Use a temporary directory in place of file name Different OS auto delete/sharing semantics are just wonky. * This will be new in 3.5.1 (rather than 4) * Test improvements from code review Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2023-01-26 11:33:22 +01:00
Marcus Blättermann	a37117abd0	Fix text colors in docs (#12186 )	2023-01-26 10:30:24 +01:00
Marcus Blättermann	056b73468c	Load components dynamically (decrease initial file size for docs) (#12175 ) * Extract `CodeBlock` component into own file * Extract `InlineCode` component into own file * Extract `TypeAnnotation` component into own file * Convert named `export` to `default export` * Remove unused `export` * Simplify `TypeAnnotation` to remove dependency for Prism * Load `Code` component dynamically * Extract `MarkdownToReact` component into own file * WIP Code Dynamic * Load `MarkdownToReact` component dynamically * Extract `htmlToReact` to own file * Load `htmlToReact` component dynamically * Dynamically load `Juniper`	2023-01-25 17:30:41 +01:00
Adriane Boyd	07dfa54669	CI: Extend website excludes (#12185 )	2023-01-25 15:35:17 +01:00
Marcus Blättermann	11f10fff60	Fix frontpage image (#12184 )	2023-01-25 13:17:35 +01:00
Marcus Blättermann	5a6000fb8b	Fix text color in docs (#12183 ) * Fix text color on landing page * Fix code color	2023-01-25 13:14:32 +01:00
Adriane Boyd	8ea15240ca	Update binder version to v3.5 (#12153 )	2023-01-25 13:14:23 +01:00
Adriane Boyd	2dbb764183	CI: Add black formatting check to validation (#12182 )	2023-01-25 12:51:37 +01:00
Marcus Blättermann	99a05734a8	Add `aria-label` to quickstart widget (#12179 )	2023-01-25 11:46:55 +01:00
Marcus Blättermann	0298b1a863	WEB-28 Increase contrast of grey text (#12178 ) * Use transparent colors to increase contrast on darker backgrounds * Increase color contrast of grey text	2023-01-25 11:46:43 +01:00
Marcus Blättermann	3062fae2ca	Fix broken URL (#12176 )	2023-01-25 11:42:19 +01:00
Marcus Blättermann	57ba37bc52	Fix regression with links in prompts (#12172 )	2023-01-25 08:51:40 +01:00
Marcus Blättermann	05a3685849	Fix broken syntax for type annotations (#12171 )	2023-01-25 08:51:25 +01:00
Paul O'Leary McCann	de360bc981	Refactor lexeme mem passing (#12125 ) * Don't pass mem pool to new lexeme function * Remove unused mem from function args Two methods calling _new_lexeme, get and get_by_orth, took mem arguments just to call the internal method. That's no longer necessary, so this cleans it up. * prettier formatting * Remove more unused mem args	2023-01-25 12:50:21 +09:00
Marcus Blättermann	f3c586f74a	Fix navigation alert (#12169 ) Fixes a regression introduced in #12163	2023-01-24 16:40:40 +01:00
Marcus Blättermann	49237f05a6	Fix `aria-hidden` element (#12163 ) * Rename CSS class to make use more clear * Rename component prop to improve code readability * Fix `aria-hidden` directly on a link element This link wouldn't have been clickable by screenreaders * Refactor component This removes a unnessary `div` and a duplicate link Co-authored-by: Ines Montani <ines@ines.io>	2023-01-24 14:44:47 +01:00
Marcus Blättermann	0a70696923	Fix wrong HTML element attribute (#12151 ) Originally introduced in `62b9c9c6d7` Original error: Warning: Invalid DOM property `class`. Did you mean `className`? React doesn't have `class`, it uses `className`.	2023-01-24 14:35:31 +01:00
Marcus Blättermann	9555e7aecf	Remove unnessary links (#12159 ) There is no need to link to the image we are already viewing and this is also considered an accessibility issue.	2023-01-24 14:01:00 +01:00
Marcus Blättermann	031f6c7b60	WEB-27 Add `alt` tags to images (#12166 ) * Update spaCy badge `alt` text * Add `next/image` component to Universe * Add missing `alt`texts	2023-01-24 13:56:14 +01:00
Marcus Blättermann	c9beb47ab7	Increase contrast of text and theme color (#12165 )	2023-01-24 13:55:20 +01:00
Marcus Blättermann	a7d6a62f7c	Remove zoom locking (#12164 ) * Fix missing comma * Activate user zoom for website This is recommended by lighthouse: > Disabling zooming is problematic for users with low vision who rely on screen magnification to properly see the contents of a web page. Learn more. Also iOS already ignores this attribute anyway.	2023-01-24 13:54:49 +01:00
Marcus Blättermann	48159e1d60	Update explosion logo (#12162 ) This fixes a misalignment of the explosion logo	2023-01-24 13:53:51 +01:00
Marcus Blättermann	7160f7835d	Fix GitHub badge (#12161 ) * Extract component * Remove rounded border form GitHub Stars badge * Add `alt` text	2023-01-24 13:53:28 +01:00
Marcus Blättermann	3aa61e615f	Add missing label (#12160 )	2023-01-24 13:52:55 +01:00
Marcus Blättermann	fcedcd54a8	WEB-30 spaCy pattern in `.png` (#12158 ) * Fix gap in landing pattern at the top * Replace `.jpg` patterns with `.png` This drastically reduces file size (for the landing page from 221kb to 57kb) while doubling the resolution to look sharper on retina displays.	2023-01-24 13:51:39 +01:00
Sofie Van Landeghem	de1fe8dce3	Fix Azure ignoring website files (#12129 ) * ignore all mdx files and all files in website * have both .md and .mdx * exclude everything but universe.json	2023-01-24 10:02:07 +01:00
Edward	e9048fd4a1	Add how to load probability tables to existing models to spaCy docs (#12051 ) * add section about adding tables to models * change to lexeme_norm * Change syntax * change to _prob * Update website/docs/usage/saving-loading.mdx Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2023-01-24 10:01:22 +01:00
Raphael Mitsch	950fceceb6	Make test_cli_find_threshold() more robust. (#12148 )	2023-01-23 14:42:33 +01:00
Richard Hudson	f9e020dd67	Fix speed problem with `top_k>1` on CPU in edit tree lemmatizer (#12017 ) * Refactor _scores2guesses * Handle arrays on GPU * Convert argmax result to raw integer Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com> * Use NumpyOps() to copy data to CPU Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com> * Changes based on review comments * Use different _scores2guesses depending on tree_k * Add tests for corner cases * Add empty line for consistency * Improve naming Co-authored-by: Daniël de Kok <me@github.danieldk.eu> * Improve naming Co-authored-by: Daniël de Kok <me@github.danieldk.eu> Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com> Co-authored-by: Daniël de Kok <me@github.danieldk.eu>	2023-01-20 19:34:11 +01:00
Marcus Blättermann	8a3ca77d9e	Fix broken social media image (#12137 )	2023-01-20 16:57:43 +01:00
Adriane Boyd	dec81508d2	Update README for v3.5 (#12132 )	2023-01-19 16:13:41 +01:00
Sofie Van Landeghem	0f5d8a27f2	3.5 usage page (#12057 ) * skeleton * Fill in non-CLI details from release notes draft * Add TODO for fuzzy matching * Website updates for v3-5 draft * Fill in usage examples * Add fuzzy matching to intro * Fix fuzzy examples * Shell example formatting * Fix typo * Format * Remove trailing periods in internal list * Update * Fix spacing for nested lists * Update InMemoryLookupKB link Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Ines Montani <ines@ines.io>	2023-01-19 16:13:04 +01:00
Adriane Boyd	1e993d3b03	Merge pull request #12121 from adrianeboyd/chore/v3.5.0-2 Revert "Temporarily skip tests that require models/compat"	2023-01-19 15:59:30 +01:00
Adriane Boyd	3b8918e166	API docs: Rename kb_in_memory to inmemorylookupkb, add to sidebar (#12128 ) * API docs: Rename kb_in_memory to inmemorylookupkb, add to sidebar * adjust to mdx * linkout to InMemoryLookupKB at first occurrence in kb.mdx * fix links to docs * revert Azure trigger setting (I'll make a separate PR) Co-authored-by: svlandeg <svlandeg@github.com>	2023-01-19 13:29:17 +01:00
Adriane Boyd	a9910b6081	Update years in website landing page (#12107 ) * Update years in website landing page * Update website/pages/index.tsx Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2023-01-19 11:08:02 +01:00
Sofie Van Landeghem	7d88c55eeb	update docs for apply (#12127 ) * update docs for apply * prettier	2023-01-19 10:37:09 +01:00
Daniël de Kok	6348a7a4b4	Set version to v4.0.0.dev0 (#12126 )	2023-01-19 09:25:34 +01:00
Adriane Boyd	28fd589b85	Move all website gitignore settings to website/.gitignore (#12120 )	2023-01-18 21:46:19 +01:00
Daniël de Kok	b052b1b47f	Fix batching regression (#12094 ) * Fix batching regression Some time ago, the spaCy v4 branch switched to the new Thinc v9 schedule. However, this introduced an error in how batching is handed. In the PR, the batchers were changed to keep track of their step, so that the step can be passed to the schedule. However, the issue is that the training loop repeatedly calls the batching functions (rather than using an infinite generator/iterator). So, the step and therefore the schedule would be reset each epoch. Before the schedule switch we didn't have this issue, because the old schedules were stateful. This PR fixes this issue by reverting the batching functions to use a (stateful) generator. Their registry functions do accept a `Schedule` and we convert `Schedule`s to generators. * Update batcher docs * Docstring fixes * Make minibatch take iterables again as well * Bump thinc requirement to 9.0.0.dev2 * Use type declaration * Convert another comment into a proper type declaration	2023-01-18 18:28:30 +01:00
Daniël de Kok	668ec989ad	Update Dockerfile to work with Next.js (#12119 ) * Update Dockerfile to work with Next.js - Update to Node 18 - Do not run as root, this also works better with Node privilege-dropping. - Update README with new run instructions and adding the `--rm` flag to avoid leaving a bunch of unused Docker containers. - Also change README to recommend building the image locally. Image builds are pretty fast and the uploaded images get outdated pretty quickly. * Add .dockerignore to avoid sending large build contexts * Typo	2023-01-18 18:15:47 +01:00
Adriane Boyd	dc0f527039	Revert "Temporarily skip tests that require models/compat" This reverts commit `378db0eb1e`.	2023-01-18 12:54:56 +01:00
Daniël de Kok	a183db3cef	Merge the parser refactor into `v4` (#10940 ) * Try to fix doc.copy * Set dev version * Make vocab always own lexemes * Change version * Add SpanGroups.copy method * Fix set_annotations during Parser.update * Fix dict proxy copy * Upd version * Fix copying SpanGroups * Fix set_annotations in parser.update * Fix parser set_annotations during update * Revert "Fix parser set_annotations during update" This reverts commit `eb138c89ed`. * Revert "Fix set_annotations in parser.update" This reverts commit `c6df0eafd0`. * Fix set_annotations during parser update * Inc version * Handle final states in get_oracle_sequence * Inc version * Try to fix parser training * Inc version * Fix * Inc version * Fix parser oracle * Inc version * Inc version * Fix transition has_gold * Inc version * Try to use real histories, not oracle * Inc version * Upd parser * Inc version * WIP on rewrite parser * WIP refactor parser * New progress on parser model refactor * Prepare to remove parser_model.pyx * Convert parser from cdef class * Delete spacy.ml.parser_model * Delete _precomputable_affine module * Wire up tb_framework to new parser model * Wire up parser model * Uncython ner.pyx and dep_parser.pyx * Uncython * Work on parser model * Support unseen_classes in parser model * Support unseen classes in parser * Cleaner handling of unseen classes * Work through tests * Keep working through errors * Keep working through errors * Work on parser. 15 tests failing * Xfail beam stuff. 9 failures * More xfail. 7 failures * Xfail. 6 failures * cleanup * formatting * fixes * pass nO through * Fix empty doc in update * Hackishly fix resizing. 3 failures * Fix redundant test. 2 failures * Add reference version * black formatting * Get tests passing with reference implementation * Fix missing prints * Add missing file * Improve indexing on reference implementation * Get non-reference forward func working * Start rigging beam back up * removing redundant tests, cf #8106 * black formatting * temporarily xfailing issue 4314 * make flake8 happy again * mypy fixes * ensure labels are added upon predict * cleanup remnants from merge conflicts * Improve unseen label masking Two changes to speed up masking by ~10%: - Use a bool array rather than an array of float32. - Let the mask indicate whether a label was seen, rather than unseen. The mask is most frequently used to index scores for seen labels. However, since the mask marked unseen labels, this required computing an intermittent flipped mask. * Write moves costs directly into numpy array (#10163) This avoids elementwise indexing and the allocation of an additional array. Gives a ~15% speed improvement when using batch_by_sequence with size 32. * Temporarily disable ner and rehearse tests Until rehearse is implemented again in the refactored parser. * Fix loss serialization issue (#10600) * Fix loss serialization issue Serialization of a model fails with: TypeError: array(738.3855, dtype=float32) is not JSON serializable Fix this using float conversion. * Disable CI steps that require spacy.TransitionBasedParser.v2 After finishing the refactor, TransitionBasedParser.v2 should be provided for backwards compat. * Add back support for beam parsing to the refactored parser (#10633) * Add back support for beam parsing Beam parsing was already implemented as part of the `BeamBatch` class. This change makes its counterpart `GreedyBatch`. Both classes are hooked up in `TransitionModel`, selecting `GreedyBatch` when the beam size is one, or `BeamBatch` otherwise. * Use kwarg for beam width Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Avoid implicit default for beam_width and beam_density * Parser.{beam,greedy}_parse: ensure labels are added * Remove 'deprecated' comments Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Parser `StateC` optimizations (#10746) * `StateC`: Optimizations Avoid GIL acquisition in `__init__` Increase default buffer capacities on init Reduce C++ exception overhead * Fix typo * Replace `set::count` with `set::find` * Add exception attribute to c'tor * Remove unused import * Use a power-of-two value for initial capacity Use default-insert to init `_heads` and `_unshiftable` * Merge `cdef` variable declarations and assignments * Vectorize `example.get_aligned_parses` (#10789) * `example`: Vectorize `get_aligned_parse` Rename `numpy` import * Convert aligned array to lists before returning * Revert import renaming * Elide slice arguments when selecting the entire range * Tagger/morphologizer alignment performance optimizations (#10798) * `example`: Unwrap `numpy` scalar arrays before passing them to `StringStore.__getitem__` * `AlignmentArray`: Use native list as staging buffer for offset calculation * `example`: Vectorize `get_aligned` * Hoist inner functions out of `get_aligned` * Replace inline `if..else` clause in assignment statement * `AlignmentArray`: Use raw indexing into offset and data `numpy` arrays * `example`: Replace array unique value check with `groupby` * `example`: Correctly exclude tokens with no alignment in `_get_aligned_vectorized` Simplify `_get_aligned_non_vectorized` * `util`: Update `all_equal` docstring * Explicitly use `int32_t` Restore C CPU inference in the refactored parser (#10747) * Bring back the C parsing model The C parsing model is used for CPU inference and is still faster for CPU inference than the forward pass of the Thinc model. * Use C sgemm provided by the Ops implementation * Make tb_framework module Cython, merge in C forward implementation * TransitionModel: raise in backprop returned from forward_cpu * Re-enable greedy parse test * Return transition scores when forward_cpu is used * Apply suggestions from code review Import `Model` from `thinc.api` Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Use relative imports in tb_framework * Don't assume a default for beam_width * We don't have a direct dependency on BLIS anymore * Rename forwards to _forward_{fallback,greedy_cpu} * Require thinc >=8.1.0,<8.2.0 * tb_framework: clean up imports * Fix return type of _get_seen_mask * Move up _forward_greedy_cpu * Style fixes. * Lower thinc lowerbound to 8.1.0.dev0 * Formatting fix Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Reimplement parser rehearsal function (#10878) * Reimplement parser rehearsal function Before the parser refactor, rehearsal was driven by a loop in the `rehearse` method itself. For each parsing step, the loops would: 1. Get the predictions of the teacher. 2. Get the predictions and backprop function of the student. 3. Compute the loss and backprop into the student. 4. Move the teacher and student forward with the predictions of the student. In the refactored parser, we cannot perform search stepwise rehearsal anymore, since the model now predicts all parsing steps at once. Therefore, rehearsal is performed in the following steps: 1. Get the predictions of all parsing steps from the student, along with its backprop function. 2. Get the predictions from the teacher, but use the predictions of the student to advance the parser while doing so. 3. Compute the loss and backprop into the student. To support the second step a new method, `advance_with_actions` is added to `GreedyBatch`, which performs the provided parsing steps. * tb_framework: wrap upper_W and upper_b in Linear Thinc's Optimizer cannot handle resizing of existing parameters. Until it does, we work around this by wrapping the weights/biases of the upper layer of the parser model in Linear. When the upper layer is resized, we copy over the existing parameters into a new Linear instance. This does not trigger an error in Optimizer, because it sees the resized layer as a new set of parameters. * Add test for TransitionSystem.apply_actions * Better FIXME marker Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com> * Fixes from Madeesh * Apply suggestions from Sofie Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Remove useless assignment Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Rename some identifiers in the parser refactor (#10935) * Rename _parseC to _parse_batch * tb_framework: prefix many auxiliary functions with underscore To clearly state the intent that they are private. * Rename `lower` to `hidden`, `upper` to `output` * Parser slow test fixup We don't have TransitionBasedParser.{v1,v2} until we bring it back as a legacy option. * Remove last vestiges of PrecomputableAffine This does not exist anymore as a separate layer. * ner: re-enable sentence boundary checks * Re-enable test that works now. * test_ner: make loss test more strict again * Remove commented line * Re-enable some more beam parser tests * Remove unused _forward_reference function * Update for CBlas changes in Thinc 8.1.0.dev2 Bump thinc dependency to 8.1.0.dev3. * Remove references to spacy.TransitionBasedParser.{v1,v2} Since they will not be offered starting with spaCy v4. * `tb_framework`: Replace references to `thinc.backends.linalg` with `CBlas` * dont use get_array_module (#11056) (#11293) Co-authored-by: kadarakos <kadar.akos@gmail.com> * Move `thinc.extra.search` to `spacy.pipeline._parser_internals` (#11317) * `search`: Move from `thinc.extra.search` Fix NPE in `Beam.__dealloc__` * `pytest`: Add support for executing Cython tests Move `search` tests from thinc and patch them to run with `pytest` * `mypy` fix * Update comment * `conftest`: Expose `register_cython_tests` * Remove unused import * Move `argmax` impls to new `_parser_utils` Cython module (#11410) * Parser does not have to be a cdef class anymore This also fixes validation of the initialization schema. * Add back spacy.TransitionBasedParser.v2 * Fix a rename that was missed in #10878. So that rehearsal tests pass. * Remove module from setup.py that got added during the merge * Bring back support for `update_with_oracle_cut_size` (#12086) * Bring back support for `update_with_oracle_cut_size` This option was available in the pre-refactor parser, but was never implemented in the refactored parser. This option cuts transition sequences that are longer than `update_with_oracle_cut` size into separate sequences that have at most `update_with_oracle_cut` transitions. The oracle (gold standard) transition sequence is used to determine the cuts and the initial states for the additional sequences. Applying this cut makes the batches more homogeneous in the transition sequence lengths, making forward passes (and as a consequence training) much faster. Training time 1000 steps on de_core_news_lg: - Before this change: 149s - After this change: 68s - Pre-refactor parser: 81s * Fix a rename that was missed in #10878. So that rehearsal tests pass. * Apply suggestions from @shadeMe * Use chained conditional * Test with update_with_oracle_cut_size={0, 1, 5, 100} And fix a git that occurs with a cut size of 1. * Fix up some merge fall out * Update parser distillation for the refactor In the old parser, we'd iterate over the transitions in the distill function and compute the loss/gradients on the go. In the refactored parser, we first let the student model parse the inputs. Then we'll let the teacher compute the transition probabilities of the states in the student's transition sequence. We can then compute the gradients of the student given the teacher. * Add back spacy.TransitionBasedParser.v1 references - Accordion in the architecture docs. - Test in test_parse, but disabled until we have a spacy-legacy release. Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com> Co-authored-by: svlandeg <svlandeg@github.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: kadarakos <kadar.akos@gmail.com>	2023-01-18 11:27:45 +01:00

1 2 3 4 5 ...

15886 Commits