spaCy/website/docs/usage/embeddings-transformers.mdx

983 lines
41 KiB
Plaintext
Raw Normal View History

2020-08-18 01:49:19 +03:00
---
title: Embeddings, Transformers and Transfer Learning
teaser: Using transformer embeddings like BERT in spaCy
menu:
- ['Embedding Layers', 'embedding-layers']
- ['Transformers', 'transformers']
- ['Static Vectors', 'static-vectors']
- ['Pretraining', 'pretraining']
next: /usage/training
---
2020-08-22 17:47:03 +03:00
spaCy supports a number of **transfer and multi-task learning** workflows that
can often help improve your pipeline's efficiency or accuracy. Transfer learning
2020-08-22 16:41:35 +03:00
refers to techniques such as word vector tables and language model pretraining.
These techniques can be used to import knowledge from raw text into your
2020-08-22 17:47:03 +03:00
pipeline, so that your models are able to generalize better from your annotated
examples.
You can convert **word vectors** from popular tools like
[FastText](https://fasttext.cc) and [Gensim](https://radimrehurek.com/gensim),
or you can load in any pretrained **transformer model** if you install
[`spacy-transformers`](https://github.com/explosion/spacy-transformers). You can
also do your own language model pretraining via the
[`spacy pretrain`](/api/cli#pretrain) command. You can even **share** your
transformer or other contextual embedding model across multiple components,
which can make long pipelines several times more efficient. To use transfer
learning, you'll need at least a few annotated examples for what you're trying
to predict. Otherwise, you could try using a "one-shot learning" approach using
2020-08-22 16:41:35 +03:00
[vectors and similarity](/usage/linguistic-features#vectors-similarity).
2020-08-21 14:22:59 +03:00
2020-08-18 01:49:19 +03:00
<Accordion title="Whats the difference between word vectors and language models?" id="vectors-vs-language-models">
2020-09-17 17:42:53 +03:00
[Transformers](#transformers) are large and powerful neural networks that give
2020-09-17 18:12:51 +03:00
you better accuracy, but are harder to deploy in production, as they require a
GPU to run effectively. [Word vectors](#word-vectors) are a slightly older
technique that can give your models a smaller improvement in accuracy, and can
also provide some additional capabilities.
The key difference between word-vectors and contextual language models such as
transformers is that word vectors model **lexical types**, rather than _tokens_.
If you have a list of terms with no context around them, a transformer model
like BERT can't really help you. BERT is designed to understand language **in
context**, which isn't what you have. A word vectors table will be a much better
fit for your task. However, if you do have words in context whole sentences or
paragraphs of running text word vectors will only provide a very rough
2020-09-17 17:42:53 +03:00
approximation of what the text is about.
2020-08-18 01:49:19 +03:00
Word vectors are also very computationally efficient, as they map a word to a
vector with a single indexing operation. Word vectors are therefore useful as a
way to **improve the accuracy** of neural network models, especially models that
are small or have received little or no pretraining. In spaCy, word vector
tables are only used as **static features**. spaCy does not backpropagate
gradients to the pretrained word vectors table. The static vectors table is
usually used in combination with a smaller table of learned task-specific
embeddings.
</Accordion>
<Accordion title="When should I add word vectors to my model?">
Word vectors are not compatible with most [transformer models](#transformers),
but if you're training another type of NLP network, it's almost always worth
adding word vectors to your model. As well as improving your final accuracy,
word vectors often make experiments more consistent, as the accuracy you reach
will be less sensitive to how the network is randomly initialized. High variance
due to random chance can slow down your progress significantly, as you need to
run many experiments to filter the signal from the noise.
Word vector features need to be enabled prior to training, and the same word
vectors table will need to be available at runtime as well. You cannot add word
vector features once the model has already been trained, and you usually cannot
replace one word vectors table with another without causing a significant loss
of performance.
</Accordion>
Website migration from Gatsby to Next (#12058) * Rename all MDX file to `.mdx` * Lock current node version (#11885) * Apply Prettier (#11996) * Minor website fixes (#11974) [ci skip] * fix table * Migrate to Next WEB-17 (#12005) * Initial commit * Run `npx create-next-app@13 next-blog` * Install MDX packages Following: https://github.com/vercel/next.js/blob/77b5f79a4dff453abb62346bf75b14d859539b81/packages/next-mdx/readme.md * Add MDX to Next * Allow Next to handle `.md` and `.mdx` files. * Add VSCode extension recommendation * Disabled TypeScript strict mode for now * Add prettier * Apply Prettier to all files * Make sure to use correct Node version * Add basic implementation for `MDXRemote` * Add experimental Rust MDX parser * Add `/public` * Add SASS support * Remove default pages and styling * Convert to module This allows to use `import/export` syntax * Add import for custom components * Add ability to load plugins * Extract function This will make the next commit easier to read * Allow to handle directories for page creation * Refactoring * Allow to parse subfolders for pages * Extract logic * Redirect `index.mdx` to parent directory * Disabled ESLint during builds * Disabled typescript during build * Remove Gatsby from `README.md` * Rephrase Docker part of `README.md` * Update project structure in `README.md` * Move and rename plugins * Update plugin for wrapping sections * Add dependencies for plugin * Use plugin * Rename wrapper type * Simplify unnessary adding of id to sections The slugified section ids are useless, because they can not be referenced anywhere anyway. The navigation only works if the section has the same id as the heading. * Add plugin for custom attributes on Markdown elements * Add plugin to readd support for tables * Add plugin to fix problem with wrapped images For more details see this issue: https://github.com/mdx-js/mdx/issues/1798 * Add necessary meta data to pages * Install necessary dependencies * Remove outdated MDX handling * Remove reliance on `InlineList` * Use existing Remark components * Remove unallowed heading Before `h1` components where not overwritten and would never have worked and they aren't used anywhere either. * Add missing components to MDX * Add correct styling * Fix broken list * Fix broken CSS classes * Implement layout * Fix links * Fix broken images * Fix pattern image * Fix heading attributes * Rename heading attribute `new` was causing some weird issue, so renaming it to `version` * Update comment syntax in MDX * Merge imports * Fix markdown rendering inside components * Add model pages * Simplify anchors * Fix default value for theme * Add Universe index page * Add Universe categories * Add Universe projects * Fix Next problem with copy Next complains when the server renders something different then the client, therfor we move the differing logic to `useEffect` * Fix improper component nesting Next doesn't allow block elements inside a `<p>` * Replace landing page MDX with page component * Remove inlined iframe content * Remove ability to inline HTML content in iFrames * Remove MDX imports * Fix problem with image inside link in MDX * Escape character for MDX * Fix unescaped characters in MDX * Fix headings with logo * Allow to export static HTML pages * Add prebuild script This command is automatically run by Next * Replace `svg-loader` with `react-inlinesvg` `svg-loader` is no longer maintained * Fix ESLint `react-hooks/exhaustive-deps` * Fix dropdowns * Change code language from `cli` to `bash` * Remove unnessary language `none` * Fix invalid code language `markdown_` with an underscore was used to basically turn of syntax highlighting, but using unknown languages know throws an error. * Enable code blocks plugin * Readd `InlineCode` component MDX2 removed the `inlineCode` component > The special component name `inlineCode` was removed, we recommend to use `pre` for the block version of code, and code for both the block and inline versions Source: https://mdxjs.com/migrating/v2/#update-mdx-content * Remove unused code * Extract function to own file * Fix code syntax highlighting * Update syntax for code block meta data * Remove unused prop * Fix internal link recognition There is a problem with regex between Node and browser, and since Next runs the component on both, this create an error. `Prop `rel` did not match. Server: "null" Client: "noopener nofollow noreferrer"` This simplifies the implementation and fixes the above error. * Replace `react-helmet` with `next/head` * Fix `className` problem for JSX component * Fix broken bold markdown * Convert file to `.mjs` to be used by Node process * Add plugin to replace strings * Fix custom table row styling * Fix problem with `span` inside inline `code` React doesn't allow a `span` inside an inline `code` element and throws an error in dev mode. * Add `_document` to be able to customize `<html>` and `<body>` * Add `lang="en"` * Store Netlify settings in file This way we don't need to update via Netlify UI, which can be tricky if changing build settings. * Add sitemap * Add Smartypants * Add PWA support * Add `manifest.webmanifest` * Fix bug with anchor links after reloading There was no need for the previous implementation, since the browser handles this nativly. Additional the manual scrolling into view was actually broken, because the heading would disappear behind the menu bar. * Rename custom event I was googeling for ages to find out what kind of event `inview` is, only to figure out it was a custom event with a name that sounds pretty much like a native one. 🫠 * Fix missing comment syntax highlighting * Refactor Quickstart component The previous implementation was hidding the irrelevant lines via data-props and dynamically generated CSS. This created problems with Next and was also hard to follow. CSS was used to do what React is supposed to handle. The new implementation simplfy filters the list of children (React elements) via their props. * Fix syntax highlighting for Training Quickstart * Unify code rendering * Improve error logging in Juniper * Fix Juniper component * Automatically generate "Read Next" link * Add Plausible * Use recent DocSearch component and adjust styling * Fix images * Turn of image optimization > Image Optimization using Next.js' default loader is not compatible with `next export`. We currently deploy to Netlify via `next export` * Dont build pages starting with `_` * Remove unused files * Add Next plugin to Netlify * Fix button layout MDX automatically adds `p` tags around text on a new line and Prettier wants to put the text on a new line. Hacking with JSX string. * Add 404 page * Apply Prettier * Update Prettier for `package.json` Next sometimes wants to patch `package-lock.json`. The old Prettier setting indended with 4 spaces, but Next always indends with 2 spaces. Since `npm install` automatically uses the indendation from `package.json` for `package-lock.json` and to avoid the format switching back and forth, both files are now set to 2 spaces. * Apply Next patch to `package-lock.json` When starting the dev server Next would warn `warn - Found lockfile missing swc dependencies, patching...` and update the `package-lock.json`. These are the patched changes. * fix link Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * small backslash fixes * adjust to new style Co-authored-by: Marcus Blättermann <marcus@essenmitsosse.de>
2023-01-11 19:30:07 +03:00
## Shared embedding layers {id="embedding-layers"}
2020-08-18 15:39:40 +03:00
2020-08-22 18:11:28 +03:00
spaCy lets you share a single transformer or other token-to-vector ("tok2vec")
2020-08-22 18:15:05 +03:00
embedding layer between multiple components. You can even update the shared
layer, performing **multi-task learning**. Reusing the tok2vec layer between
components can make your pipeline run a lot faster and result in much smaller
models. However, it can make the pipeline less modular and make it more
2020-08-22 18:11:28 +03:00
difficult to swap components or retrain parts of the pipeline. Multi-task
learning can affect your accuracy (either positively or negatively), and may
require some retuning of your hyper-parameters.
2020-08-18 15:39:40 +03:00
Website migration from Gatsby to Next (#12058) * Rename all MDX file to `.mdx` * Lock current node version (#11885) * Apply Prettier (#11996) * Minor website fixes (#11974) [ci skip] * fix table * Migrate to Next WEB-17 (#12005) * Initial commit * Run `npx create-next-app@13 next-blog` * Install MDX packages Following: https://github.com/vercel/next.js/blob/77b5f79a4dff453abb62346bf75b14d859539b81/packages/next-mdx/readme.md * Add MDX to Next * Allow Next to handle `.md` and `.mdx` files. * Add VSCode extension recommendation * Disabled TypeScript strict mode for now * Add prettier * Apply Prettier to all files * Make sure to use correct Node version * Add basic implementation for `MDXRemote` * Add experimental Rust MDX parser * Add `/public` * Add SASS support * Remove default pages and styling * Convert to module This allows to use `import/export` syntax * Add import for custom components * Add ability to load plugins * Extract function This will make the next commit easier to read * Allow to handle directories for page creation * Refactoring * Allow to parse subfolders for pages * Extract logic * Redirect `index.mdx` to parent directory * Disabled ESLint during builds * Disabled typescript during build * Remove Gatsby from `README.md` * Rephrase Docker part of `README.md` * Update project structure in `README.md` * Move and rename plugins * Update plugin for wrapping sections * Add dependencies for plugin * Use plugin * Rename wrapper type * Simplify unnessary adding of id to sections The slugified section ids are useless, because they can not be referenced anywhere anyway. The navigation only works if the section has the same id as the heading. * Add plugin for custom attributes on Markdown elements * Add plugin to readd support for tables * Add plugin to fix problem with wrapped images For more details see this issue: https://github.com/mdx-js/mdx/issues/1798 * Add necessary meta data to pages * Install necessary dependencies * Remove outdated MDX handling * Remove reliance on `InlineList` * Use existing Remark components * Remove unallowed heading Before `h1` components where not overwritten and would never have worked and they aren't used anywhere either. * Add missing components to MDX * Add correct styling * Fix broken list * Fix broken CSS classes * Implement layout * Fix links * Fix broken images * Fix pattern image * Fix heading attributes * Rename heading attribute `new` was causing some weird issue, so renaming it to `version` * Update comment syntax in MDX * Merge imports * Fix markdown rendering inside components * Add model pages * Simplify anchors * Fix default value for theme * Add Universe index page * Add Universe categories * Add Universe projects * Fix Next problem with copy Next complains when the server renders something different then the client, therfor we move the differing logic to `useEffect` * Fix improper component nesting Next doesn't allow block elements inside a `<p>` * Replace landing page MDX with page component * Remove inlined iframe content * Remove ability to inline HTML content in iFrames * Remove MDX imports * Fix problem with image inside link in MDX * Escape character for MDX * Fix unescaped characters in MDX * Fix headings with logo * Allow to export static HTML pages * Add prebuild script This command is automatically run by Next * Replace `svg-loader` with `react-inlinesvg` `svg-loader` is no longer maintained * Fix ESLint `react-hooks/exhaustive-deps` * Fix dropdowns * Change code language from `cli` to `bash` * Remove unnessary language `none` * Fix invalid code language `markdown_` with an underscore was used to basically turn of syntax highlighting, but using unknown languages know throws an error. * Enable code blocks plugin * Readd `InlineCode` component MDX2 removed the `inlineCode` component > The special component name `inlineCode` was removed, we recommend to use `pre` for the block version of code, and code for both the block and inline versions Source: https://mdxjs.com/migrating/v2/#update-mdx-content * Remove unused code * Extract function to own file * Fix code syntax highlighting * Update syntax for code block meta data * Remove unused prop * Fix internal link recognition There is a problem with regex between Node and browser, and since Next runs the component on both, this create an error. `Prop `rel` did not match. Server: "null" Client: "noopener nofollow noreferrer"` This simplifies the implementation and fixes the above error. * Replace `react-helmet` with `next/head` * Fix `className` problem for JSX component * Fix broken bold markdown * Convert file to `.mjs` to be used by Node process * Add plugin to replace strings * Fix custom table row styling * Fix problem with `span` inside inline `code` React doesn't allow a `span` inside an inline `code` element and throws an error in dev mode. * Add `_document` to be able to customize `<html>` and `<body>` * Add `lang="en"` * Store Netlify settings in file This way we don't need to update via Netlify UI, which can be tricky if changing build settings. * Add sitemap * Add Smartypants * Add PWA support * Add `manifest.webmanifest` * Fix bug with anchor links after reloading There was no need for the previous implementation, since the browser handles this nativly. Additional the manual scrolling into view was actually broken, because the heading would disappear behind the menu bar. * Rename custom event I was googeling for ages to find out what kind of event `inview` is, only to figure out it was a custom event with a name that sounds pretty much like a native one. 🫠 * Fix missing comment syntax highlighting * Refactor Quickstart component The previous implementation was hidding the irrelevant lines via data-props and dynamically generated CSS. This created problems with Next and was also hard to follow. CSS was used to do what React is supposed to handle. The new implementation simplfy filters the list of children (React elements) via their props. * Fix syntax highlighting for Training Quickstart * Unify code rendering * Improve error logging in Juniper * Fix Juniper component * Automatically generate "Read Next" link * Add Plausible * Use recent DocSearch component and adjust styling * Fix images * Turn of image optimization > Image Optimization using Next.js' default loader is not compatible with `next export`. We currently deploy to Netlify via `next export` * Dont build pages starting with `_` * Remove unused files * Add Next plugin to Netlify * Fix button layout MDX automatically adds `p` tags around text on a new line and Prettier wants to put the text on a new line. Hacking with JSX string. * Add 404 page * Apply Prettier * Update Prettier for `package.json` Next sometimes wants to patch `package-lock.json`. The old Prettier setting indended with 4 spaces, but Next always indends with 2 spaces. Since `npm install` automatically uses the indendation from `package.json` for `package-lock.json` and to avoid the format switching back and forth, both files are now set to 2 spaces. * Apply Next patch to `package-lock.json` When starting the dev server Next would warn `warn - Found lockfile missing swc dependencies, patching...` and update the `package-lock.json`. These are the patched changes. * fix link Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * small backslash fixes * adjust to new style Co-authored-by: Marcus Blättermann <marcus@essenmitsosse.de>
2023-01-11 19:30:07 +03:00
![Pipeline components using a shared embedding component vs. independent embedding layers](/images/tok2vec.svg)
2020-08-18 15:39:40 +03:00
| Shared | Independent |
| ------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------- |
| ✅ **smaller:** models only need to include a single copy of the embeddings | ❌ **larger:** models need to include the embeddings for each component |
2020-08-22 16:41:35 +03:00
| ✅ **faster:** embed the documents once for your whole pipeline | ❌ **slower:** rerun the embedding for each component |
2020-08-18 15:39:40 +03:00
| ❌ **less composable:** all components require the same embedding component in the pipeline | ✅ **modular:** components can be moved and swapped freely |
2020-08-22 17:47:03 +03:00
2020-08-22 18:15:05 +03:00
You can share a single transformer or other tok2vec model between multiple
components by adding a [`Transformer`](/api/transformer) or
[`Tok2Vec`](/api/tok2vec) component near the start of your pipeline. Components
later in the pipeline can "connect" to it by including a **listener layer** like
[Tok2VecListener](/api/architectures#Tok2VecListener) within their model.
2020-08-18 15:39:40 +03:00
Website migration from Gatsby to Next (#12058) * Rename all MDX file to `.mdx` * Lock current node version (#11885) * Apply Prettier (#11996) * Minor website fixes (#11974) [ci skip] * fix table * Migrate to Next WEB-17 (#12005) * Initial commit * Run `npx create-next-app@13 next-blog` * Install MDX packages Following: https://github.com/vercel/next.js/blob/77b5f79a4dff453abb62346bf75b14d859539b81/packages/next-mdx/readme.md * Add MDX to Next * Allow Next to handle `.md` and `.mdx` files. * Add VSCode extension recommendation * Disabled TypeScript strict mode for now * Add prettier * Apply Prettier to all files * Make sure to use correct Node version * Add basic implementation for `MDXRemote` * Add experimental Rust MDX parser * Add `/public` * Add SASS support * Remove default pages and styling * Convert to module This allows to use `import/export` syntax * Add import for custom components * Add ability to load plugins * Extract function This will make the next commit easier to read * Allow to handle directories for page creation * Refactoring * Allow to parse subfolders for pages * Extract logic * Redirect `index.mdx` to parent directory * Disabled ESLint during builds * Disabled typescript during build * Remove Gatsby from `README.md` * Rephrase Docker part of `README.md` * Update project structure in `README.md` * Move and rename plugins * Update plugin for wrapping sections * Add dependencies for plugin * Use plugin * Rename wrapper type * Simplify unnessary adding of id to sections The slugified section ids are useless, because they can not be referenced anywhere anyway. The navigation only works if the section has the same id as the heading. * Add plugin for custom attributes on Markdown elements * Add plugin to readd support for tables * Add plugin to fix problem with wrapped images For more details see this issue: https://github.com/mdx-js/mdx/issues/1798 * Add necessary meta data to pages * Install necessary dependencies * Remove outdated MDX handling * Remove reliance on `InlineList` * Use existing Remark components * Remove unallowed heading Before `h1` components where not overwritten and would never have worked and they aren't used anywhere either. * Add missing components to MDX * Add correct styling * Fix broken list * Fix broken CSS classes * Implement layout * Fix links * Fix broken images * Fix pattern image * Fix heading attributes * Rename heading attribute `new` was causing some weird issue, so renaming it to `version` * Update comment syntax in MDX * Merge imports * Fix markdown rendering inside components * Add model pages * Simplify anchors * Fix default value for theme * Add Universe index page * Add Universe categories * Add Universe projects * Fix Next problem with copy Next complains when the server renders something different then the client, therfor we move the differing logic to `useEffect` * Fix improper component nesting Next doesn't allow block elements inside a `<p>` * Replace landing page MDX with page component * Remove inlined iframe content * Remove ability to inline HTML content in iFrames * Remove MDX imports * Fix problem with image inside link in MDX * Escape character for MDX * Fix unescaped characters in MDX * Fix headings with logo * Allow to export static HTML pages * Add prebuild script This command is automatically run by Next * Replace `svg-loader` with `react-inlinesvg` `svg-loader` is no longer maintained * Fix ESLint `react-hooks/exhaustive-deps` * Fix dropdowns * Change code language from `cli` to `bash` * Remove unnessary language `none` * Fix invalid code language `markdown_` with an underscore was used to basically turn of syntax highlighting, but using unknown languages know throws an error. * Enable code blocks plugin * Readd `InlineCode` component MDX2 removed the `inlineCode` component > The special component name `inlineCode` was removed, we recommend to use `pre` for the block version of code, and code for both the block and inline versions Source: https://mdxjs.com/migrating/v2/#update-mdx-content * Remove unused code * Extract function to own file * Fix code syntax highlighting * Update syntax for code block meta data * Remove unused prop * Fix internal link recognition There is a problem with regex between Node and browser, and since Next runs the component on both, this create an error. `Prop `rel` did not match. Server: "null" Client: "noopener nofollow noreferrer"` This simplifies the implementation and fixes the above error. * Replace `react-helmet` with `next/head` * Fix `className` problem for JSX component * Fix broken bold markdown * Convert file to `.mjs` to be used by Node process * Add plugin to replace strings * Fix custom table row styling * Fix problem with `span` inside inline `code` React doesn't allow a `span` inside an inline `code` element and throws an error in dev mode. * Add `_document` to be able to customize `<html>` and `<body>` * Add `lang="en"` * Store Netlify settings in file This way we don't need to update via Netlify UI, which can be tricky if changing build settings. * Add sitemap * Add Smartypants * Add PWA support * Add `manifest.webmanifest` * Fix bug with anchor links after reloading There was no need for the previous implementation, since the browser handles this nativly. Additional the manual scrolling into view was actually broken, because the heading would disappear behind the menu bar. * Rename custom event I was googeling for ages to find out what kind of event `inview` is, only to figure out it was a custom event with a name that sounds pretty much like a native one. 🫠 * Fix missing comment syntax highlighting * Refactor Quickstart component The previous implementation was hidding the irrelevant lines via data-props and dynamically generated CSS. This created problems with Next and was also hard to follow. CSS was used to do what React is supposed to handle. The new implementation simplfy filters the list of children (React elements) via their props. * Fix syntax highlighting for Training Quickstart * Unify code rendering * Improve error logging in Juniper * Fix Juniper component * Automatically generate "Read Next" link * Add Plausible * Use recent DocSearch component and adjust styling * Fix images * Turn of image optimization > Image Optimization using Next.js' default loader is not compatible with `next export`. We currently deploy to Netlify via `next export` * Dont build pages starting with `_` * Remove unused files * Add Next plugin to Netlify * Fix button layout MDX automatically adds `p` tags around text on a new line and Prettier wants to put the text on a new line. Hacking with JSX string. * Add 404 page * Apply Prettier * Update Prettier for `package.json` Next sometimes wants to patch `package-lock.json`. The old Prettier setting indended with 4 spaces, but Next always indends with 2 spaces. Since `npm install` automatically uses the indendation from `package.json` for `package-lock.json` and to avoid the format switching back and forth, both files are now set to 2 spaces. * Apply Next patch to `package-lock.json` When starting the dev server Next would warn `warn - Found lockfile missing swc dependencies, patching...` and update the `package-lock.json`. These are the patched changes. * fix link Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * small backslash fixes * adjust to new style Co-authored-by: Marcus Blättermann <marcus@essenmitsosse.de>
2023-01-11 19:30:07 +03:00
![Pipeline components listening to shared embedding component](/images/tok2vec-listener.svg)
2020-08-18 15:39:40 +03:00
2020-08-22 17:47:03 +03:00
At the beginning of training, the [`Tok2Vec`](/api/tok2vec) component will grab
a reference to the relevant listener layers in the rest of your pipeline. When
it processes a batch of documents, it will pass forward its predictions to the
listeners, allowing the listeners to **reuse the predictions** when they are
eventually called. A similar mechanism is used to pass gradients from the
listeners back to the model. The [`Transformer`](/api/transformer) component and
2020-08-26 14:31:01 +03:00
[TransformerListener](/api/architectures#TransformerListener) layer do the same
2020-08-22 18:15:05 +03:00
thing for transformer models, but the `Transformer` component will also save the
transformer outputs to the
[`Doc._.trf_data`](/api/transformer#custom_attributes) extension attribute,
giving you access to them after the pipeline has finished running.
2020-08-22 17:47:03 +03:00
Website migration from Gatsby to Next (#12058) * Rename all MDX file to `.mdx` * Lock current node version (#11885) * Apply Prettier (#11996) * Minor website fixes (#11974) [ci skip] * fix table * Migrate to Next WEB-17 (#12005) * Initial commit * Run `npx create-next-app@13 next-blog` * Install MDX packages Following: https://github.com/vercel/next.js/blob/77b5f79a4dff453abb62346bf75b14d859539b81/packages/next-mdx/readme.md * Add MDX to Next * Allow Next to handle `.md` and `.mdx` files. * Add VSCode extension recommendation * Disabled TypeScript strict mode for now * Add prettier * Apply Prettier to all files * Make sure to use correct Node version * Add basic implementation for `MDXRemote` * Add experimental Rust MDX parser * Add `/public` * Add SASS support * Remove default pages and styling * Convert to module This allows to use `import/export` syntax * Add import for custom components * Add ability to load plugins * Extract function This will make the next commit easier to read * Allow to handle directories for page creation * Refactoring * Allow to parse subfolders for pages * Extract logic * Redirect `index.mdx` to parent directory * Disabled ESLint during builds * Disabled typescript during build * Remove Gatsby from `README.md` * Rephrase Docker part of `README.md` * Update project structure in `README.md` * Move and rename plugins * Update plugin for wrapping sections * Add dependencies for plugin * Use plugin * Rename wrapper type * Simplify unnessary adding of id to sections The slugified section ids are useless, because they can not be referenced anywhere anyway. The navigation only works if the section has the same id as the heading. * Add plugin for custom attributes on Markdown elements * Add plugin to readd support for tables * Add plugin to fix problem with wrapped images For more details see this issue: https://github.com/mdx-js/mdx/issues/1798 * Add necessary meta data to pages * Install necessary dependencies * Remove outdated MDX handling * Remove reliance on `InlineList` * Use existing Remark components * Remove unallowed heading Before `h1` components where not overwritten and would never have worked and they aren't used anywhere either. * Add missing components to MDX * Add correct styling * Fix broken list * Fix broken CSS classes * Implement layout * Fix links * Fix broken images * Fix pattern image * Fix heading attributes * Rename heading attribute `new` was causing some weird issue, so renaming it to `version` * Update comment syntax in MDX * Merge imports * Fix markdown rendering inside components * Add model pages * Simplify anchors * Fix default value for theme * Add Universe index page * Add Universe categories * Add Universe projects * Fix Next problem with copy Next complains when the server renders something different then the client, therfor we move the differing logic to `useEffect` * Fix improper component nesting Next doesn't allow block elements inside a `<p>` * Replace landing page MDX with page component * Remove inlined iframe content * Remove ability to inline HTML content in iFrames * Remove MDX imports * Fix problem with image inside link in MDX * Escape character for MDX * Fix unescaped characters in MDX * Fix headings with logo * Allow to export static HTML pages * Add prebuild script This command is automatically run by Next * Replace `svg-loader` with `react-inlinesvg` `svg-loader` is no longer maintained * Fix ESLint `react-hooks/exhaustive-deps` * Fix dropdowns * Change code language from `cli` to `bash` * Remove unnessary language `none` * Fix invalid code language `markdown_` with an underscore was used to basically turn of syntax highlighting, but using unknown languages know throws an error. * Enable code blocks plugin * Readd `InlineCode` component MDX2 removed the `inlineCode` component > The special component name `inlineCode` was removed, we recommend to use `pre` for the block version of code, and code for both the block and inline versions Source: https://mdxjs.com/migrating/v2/#update-mdx-content * Remove unused code * Extract function to own file * Fix code syntax highlighting * Update syntax for code block meta data * Remove unused prop * Fix internal link recognition There is a problem with regex between Node and browser, and since Next runs the component on both, this create an error. `Prop `rel` did not match. Server: "null" Client: "noopener nofollow noreferrer"` This simplifies the implementation and fixes the above error. * Replace `react-helmet` with `next/head` * Fix `className` problem for JSX component * Fix broken bold markdown * Convert file to `.mjs` to be used by Node process * Add plugin to replace strings * Fix custom table row styling * Fix problem with `span` inside inline `code` React doesn't allow a `span` inside an inline `code` element and throws an error in dev mode. * Add `_document` to be able to customize `<html>` and `<body>` * Add `lang="en"` * Store Netlify settings in file This way we don't need to update via Netlify UI, which can be tricky if changing build settings. * Add sitemap * Add Smartypants * Add PWA support * Add `manifest.webmanifest` * Fix bug with anchor links after reloading There was no need for the previous implementation, since the browser handles this nativly. Additional the manual scrolling into view was actually broken, because the heading would disappear behind the menu bar. * Rename custom event I was googeling for ages to find out what kind of event `inview` is, only to figure out it was a custom event with a name that sounds pretty much like a native one. 🫠 * Fix missing comment syntax highlighting * Refactor Quickstart component The previous implementation was hidding the irrelevant lines via data-props and dynamically generated CSS. This created problems with Next and was also hard to follow. CSS was used to do what React is supposed to handle. The new implementation simplfy filters the list of children (React elements) via their props. * Fix syntax highlighting for Training Quickstart * Unify code rendering * Improve error logging in Juniper * Fix Juniper component * Automatically generate "Read Next" link * Add Plausible * Use recent DocSearch component and adjust styling * Fix images * Turn of image optimization > Image Optimization using Next.js' default loader is not compatible with `next export`. We currently deploy to Netlify via `next export` * Dont build pages starting with `_` * Remove unused files * Add Next plugin to Netlify * Fix button layout MDX automatically adds `p` tags around text on a new line and Prettier wants to put the text on a new line. Hacking with JSX string. * Add 404 page * Apply Prettier * Update Prettier for `package.json` Next sometimes wants to patch `package-lock.json`. The old Prettier setting indended with 4 spaces, but Next always indends with 2 spaces. Since `npm install` automatically uses the indendation from `package.json` for `package-lock.json` and to avoid the format switching back and forth, both files are now set to 2 spaces. * Apply Next patch to `package-lock.json` When starting the dev server Next would warn `warn - Found lockfile missing swc dependencies, patching...` and update the `package-lock.json`. These are the patched changes. * fix link Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * small backslash fixes * adjust to new style Co-authored-by: Marcus Blättermann <marcus@essenmitsosse.de>
2023-01-11 19:30:07 +03:00
### Example: Shared vs. independent config {id="embedding-layers-config"}
2020-08-25 12:54:37 +03:00
The [config system](/usage/training#config) lets you express model configuration
for both shared and independent embedding layers. The shared setup uses a single
[`Tok2Vec`](/api/tok2vec) component with the
[Tok2Vec](/api/architectures#Tok2Vec) architecture. All other components, like
the entity recognizer, use a
[Tok2VecListener](/api/architectures#Tok2VecListener) layer as their model's
`tok2vec` argument, which connects to the `tok2vec` component model.
Website migration from Gatsby to Next (#12058) * Rename all MDX file to `.mdx` * Lock current node version (#11885) * Apply Prettier (#11996) * Minor website fixes (#11974) [ci skip] * fix table * Migrate to Next WEB-17 (#12005) * Initial commit * Run `npx create-next-app@13 next-blog` * Install MDX packages Following: https://github.com/vercel/next.js/blob/77b5f79a4dff453abb62346bf75b14d859539b81/packages/next-mdx/readme.md * Add MDX to Next * Allow Next to handle `.md` and `.mdx` files. * Add VSCode extension recommendation * Disabled TypeScript strict mode for now * Add prettier * Apply Prettier to all files * Make sure to use correct Node version * Add basic implementation for `MDXRemote` * Add experimental Rust MDX parser * Add `/public` * Add SASS support * Remove default pages and styling * Convert to module This allows to use `import/export` syntax * Add import for custom components * Add ability to load plugins * Extract function This will make the next commit easier to read * Allow to handle directories for page creation * Refactoring * Allow to parse subfolders for pages * Extract logic * Redirect `index.mdx` to parent directory * Disabled ESLint during builds * Disabled typescript during build * Remove Gatsby from `README.md` * Rephrase Docker part of `README.md` * Update project structure in `README.md` * Move and rename plugins * Update plugin for wrapping sections * Add dependencies for plugin * Use plugin * Rename wrapper type * Simplify unnessary adding of id to sections The slugified section ids are useless, because they can not be referenced anywhere anyway. The navigation only works if the section has the same id as the heading. * Add plugin for custom attributes on Markdown elements * Add plugin to readd support for tables * Add plugin to fix problem with wrapped images For more details see this issue: https://github.com/mdx-js/mdx/issues/1798 * Add necessary meta data to pages * Install necessary dependencies * Remove outdated MDX handling * Remove reliance on `InlineList` * Use existing Remark components * Remove unallowed heading Before `h1` components where not overwritten and would never have worked and they aren't used anywhere either. * Add missing components to MDX * Add correct styling * Fix broken list * Fix broken CSS classes * Implement layout * Fix links * Fix broken images * Fix pattern image * Fix heading attributes * Rename heading attribute `new` was causing some weird issue, so renaming it to `version` * Update comment syntax in MDX * Merge imports * Fix markdown rendering inside components * Add model pages * Simplify anchors * Fix default value for theme * Add Universe index page * Add Universe categories * Add Universe projects * Fix Next problem with copy Next complains when the server renders something different then the client, therfor we move the differing logic to `useEffect` * Fix improper component nesting Next doesn't allow block elements inside a `<p>` * Replace landing page MDX with page component * Remove inlined iframe content * Remove ability to inline HTML content in iFrames * Remove MDX imports * Fix problem with image inside link in MDX * Escape character for MDX * Fix unescaped characters in MDX * Fix headings with logo * Allow to export static HTML pages * Add prebuild script This command is automatically run by Next * Replace `svg-loader` with `react-inlinesvg` `svg-loader` is no longer maintained * Fix ESLint `react-hooks/exhaustive-deps` * Fix dropdowns * Change code language from `cli` to `bash` * Remove unnessary language `none` * Fix invalid code language `markdown_` with an underscore was used to basically turn of syntax highlighting, but using unknown languages know throws an error. * Enable code blocks plugin * Readd `InlineCode` component MDX2 removed the `inlineCode` component > The special component name `inlineCode` was removed, we recommend to use `pre` for the block version of code, and code for both the block and inline versions Source: https://mdxjs.com/migrating/v2/#update-mdx-content * Remove unused code * Extract function to own file * Fix code syntax highlighting * Update syntax for code block meta data * Remove unused prop * Fix internal link recognition There is a problem with regex between Node and browser, and since Next runs the component on both, this create an error. `Prop `rel` did not match. Server: "null" Client: "noopener nofollow noreferrer"` This simplifies the implementation and fixes the above error. * Replace `react-helmet` with `next/head` * Fix `className` problem for JSX component * Fix broken bold markdown * Convert file to `.mjs` to be used by Node process * Add plugin to replace strings * Fix custom table row styling * Fix problem with `span` inside inline `code` React doesn't allow a `span` inside an inline `code` element and throws an error in dev mode. * Add `_document` to be able to customize `<html>` and `<body>` * Add `lang="en"` * Store Netlify settings in file This way we don't need to update via Netlify UI, which can be tricky if changing build settings. * Add sitemap * Add Smartypants * Add PWA support * Add `manifest.webmanifest` * Fix bug with anchor links after reloading There was no need for the previous implementation, since the browser handles this nativly. Additional the manual scrolling into view was actually broken, because the heading would disappear behind the menu bar. * Rename custom event I was googeling for ages to find out what kind of event `inview` is, only to figure out it was a custom event with a name that sounds pretty much like a native one. 🫠 * Fix missing comment syntax highlighting * Refactor Quickstart component The previous implementation was hidding the irrelevant lines via data-props and dynamically generated CSS. This created problems with Next and was also hard to follow. CSS was used to do what React is supposed to handle. The new implementation simplfy filters the list of children (React elements) via their props. * Fix syntax highlighting for Training Quickstart * Unify code rendering * Improve error logging in Juniper * Fix Juniper component * Automatically generate "Read Next" link * Add Plausible * Use recent DocSearch component and adjust styling * Fix images * Turn of image optimization > Image Optimization using Next.js' default loader is not compatible with `next export`. We currently deploy to Netlify via `next export` * Dont build pages starting with `_` * Remove unused files * Add Next plugin to Netlify * Fix button layout MDX automatically adds `p` tags around text on a new line and Prettier wants to put the text on a new line. Hacking with JSX string. * Add 404 page * Apply Prettier * Update Prettier for `package.json` Next sometimes wants to patch `package-lock.json`. The old Prettier setting indended with 4 spaces, but Next always indends with 2 spaces. Since `npm install` automatically uses the indendation from `package.json` for `package-lock.json` and to avoid the format switching back and forth, both files are now set to 2 spaces. * Apply Next patch to `package-lock.json` When starting the dev server Next would warn `warn - Found lockfile missing swc dependencies, patching...` and update the `package-lock.json`. These are the patched changes. * fix link Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * small backslash fixes * adjust to new style Co-authored-by: Marcus Blättermann <marcus@essenmitsosse.de>
2023-01-11 19:30:07 +03:00
```ini {title="Shared",highlight="1-2,4-5,19-20"}
2020-08-25 12:54:37 +03:00
[components.tok2vec]
factory = "tok2vec"
[components.tok2vec.model]
@architectures = "spacy.Tok2Vec.v2"
2020-08-25 12:54:37 +03:00
[components.tok2vec.model.embed]
@architectures = "spacy.MultiHashEmbed.v2"
2020-08-25 12:54:37 +03:00
[components.tok2vec.model.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
2020-08-25 12:54:37 +03:00
[components.ner]
factory = "ner"
[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v1"
[components.ner.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
```
In the independent setup, the entity recognizer component defines its own
[Tok2Vec](/api/architectures#Tok2Vec) instance. Other components will do the
same. This makes them fully independent and doesn't require an upstream
[`Tok2Vec`](/api/tok2vec) component to be present in the pipeline.
Website migration from Gatsby to Next (#12058) * Rename all MDX file to `.mdx` * Lock current node version (#11885) * Apply Prettier (#11996) * Minor website fixes (#11974) [ci skip] * fix table * Migrate to Next WEB-17 (#12005) * Initial commit * Run `npx create-next-app@13 next-blog` * Install MDX packages Following: https://github.com/vercel/next.js/blob/77b5f79a4dff453abb62346bf75b14d859539b81/packages/next-mdx/readme.md * Add MDX to Next * Allow Next to handle `.md` and `.mdx` files. * Add VSCode extension recommendation * Disabled TypeScript strict mode for now * Add prettier * Apply Prettier to all files * Make sure to use correct Node version * Add basic implementation for `MDXRemote` * Add experimental Rust MDX parser * Add `/public` * Add SASS support * Remove default pages and styling * Convert to module This allows to use `import/export` syntax * Add import for custom components * Add ability to load plugins * Extract function This will make the next commit easier to read * Allow to handle directories for page creation * Refactoring * Allow to parse subfolders for pages * Extract logic * Redirect `index.mdx` to parent directory * Disabled ESLint during builds * Disabled typescript during build * Remove Gatsby from `README.md` * Rephrase Docker part of `README.md` * Update project structure in `README.md` * Move and rename plugins * Update plugin for wrapping sections * Add dependencies for plugin * Use plugin * Rename wrapper type * Simplify unnessary adding of id to sections The slugified section ids are useless, because they can not be referenced anywhere anyway. The navigation only works if the section has the same id as the heading. * Add plugin for custom attributes on Markdown elements * Add plugin to readd support for tables * Add plugin to fix problem with wrapped images For more details see this issue: https://github.com/mdx-js/mdx/issues/1798 * Add necessary meta data to pages * Install necessary dependencies * Remove outdated MDX handling * Remove reliance on `InlineList` * Use existing Remark components * Remove unallowed heading Before `h1` components where not overwritten and would never have worked and they aren't used anywhere either. * Add missing components to MDX * Add correct styling * Fix broken list * Fix broken CSS classes * Implement layout * Fix links * Fix broken images * Fix pattern image * Fix heading attributes * Rename heading attribute `new` was causing some weird issue, so renaming it to `version` * Update comment syntax in MDX * Merge imports * Fix markdown rendering inside components * Add model pages * Simplify anchors * Fix default value for theme * Add Universe index page * Add Universe categories * Add Universe projects * Fix Next problem with copy Next complains when the server renders something different then the client, therfor we move the differing logic to `useEffect` * Fix improper component nesting Next doesn't allow block elements inside a `<p>` * Replace landing page MDX with page component * Remove inlined iframe content * Remove ability to inline HTML content in iFrames * Remove MDX imports * Fix problem with image inside link in MDX * Escape character for MDX * Fix unescaped characters in MDX * Fix headings with logo * Allow to export static HTML pages * Add prebuild script This command is automatically run by Next * Replace `svg-loader` with `react-inlinesvg` `svg-loader` is no longer maintained * Fix ESLint `react-hooks/exhaustive-deps` * Fix dropdowns * Change code language from `cli` to `bash` * Remove unnessary language `none` * Fix invalid code language `markdown_` with an underscore was used to basically turn of syntax highlighting, but using unknown languages know throws an error. * Enable code blocks plugin * Readd `InlineCode` component MDX2 removed the `inlineCode` component > The special component name `inlineCode` was removed, we recommend to use `pre` for the block version of code, and code for both the block and inline versions Source: https://mdxjs.com/migrating/v2/#update-mdx-content * Remove unused code * Extract function to own file * Fix code syntax highlighting * Update syntax for code block meta data * Remove unused prop * Fix internal link recognition There is a problem with regex between Node and browser, and since Next runs the component on both, this create an error. `Prop `rel` did not match. Server: "null" Client: "noopener nofollow noreferrer"` This simplifies the implementation and fixes the above error. * Replace `react-helmet` with `next/head` * Fix `className` problem for JSX component * Fix broken bold markdown * Convert file to `.mjs` to be used by Node process * Add plugin to replace strings * Fix custom table row styling * Fix problem with `span` inside inline `code` React doesn't allow a `span` inside an inline `code` element and throws an error in dev mode. * Add `_document` to be able to customize `<html>` and `<body>` * Add `lang="en"` * Store Netlify settings in file This way we don't need to update via Netlify UI, which can be tricky if changing build settings. * Add sitemap * Add Smartypants * Add PWA support * Add `manifest.webmanifest` * Fix bug with anchor links after reloading There was no need for the previous implementation, since the browser handles this nativly. Additional the manual scrolling into view was actually broken, because the heading would disappear behind the menu bar. * Rename custom event I was googeling for ages to find out what kind of event `inview` is, only to figure out it was a custom event with a name that sounds pretty much like a native one. 🫠 * Fix missing comment syntax highlighting * Refactor Quickstart component The previous implementation was hidding the irrelevant lines via data-props and dynamically generated CSS. This created problems with Next and was also hard to follow. CSS was used to do what React is supposed to handle. The new implementation simplfy filters the list of children (React elements) via their props. * Fix syntax highlighting for Training Quickstart * Unify code rendering * Improve error logging in Juniper * Fix Juniper component * Automatically generate "Read Next" link * Add Plausible * Use recent DocSearch component and adjust styling * Fix images * Turn of image optimization > Image Optimization using Next.js' default loader is not compatible with `next export`. We currently deploy to Netlify via `next export` * Dont build pages starting with `_` * Remove unused files * Add Next plugin to Netlify * Fix button layout MDX automatically adds `p` tags around text on a new line and Prettier wants to put the text on a new line. Hacking with JSX string. * Add 404 page * Apply Prettier * Update Prettier for `package.json` Next sometimes wants to patch `package-lock.json`. The old Prettier setting indended with 4 spaces, but Next always indends with 2 spaces. Since `npm install` automatically uses the indendation from `package.json` for `package-lock.json` and to avoid the format switching back and forth, both files are now set to 2 spaces. * Apply Next patch to `package-lock.json` When starting the dev server Next would warn `warn - Found lockfile missing swc dependencies, patching...` and update the `package-lock.json`. These are the patched changes. * fix link Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * small backslash fixes * adjust to new style Co-authored-by: Marcus Blättermann <marcus@essenmitsosse.de>
2023-01-11 19:30:07 +03:00
```ini {title="Independent", highlight="7-8"}
2020-08-25 12:54:37 +03:00
[components.ner]
factory = "ner"
[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v1"
[components.ner.model.tok2vec]
@architectures = "spacy.Tok2Vec.v2"
2020-08-25 12:54:37 +03:00
[components.ner.model.tok2vec.embed]
@architectures = "spacy.MultiHashEmbed.v2"
2020-08-25 12:54:37 +03:00
[components.ner.model.tok2vec.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
2020-08-25 12:54:37 +03:00
```
2020-08-22 17:47:03 +03:00
Website migration from Gatsby to Next (#12058) * Rename all MDX file to `.mdx` * Lock current node version (#11885) * Apply Prettier (#11996) * Minor website fixes (#11974) [ci skip] * fix table * Migrate to Next WEB-17 (#12005) * Initial commit * Run `npx create-next-app@13 next-blog` * Install MDX packages Following: https://github.com/vercel/next.js/blob/77b5f79a4dff453abb62346bf75b14d859539b81/packages/next-mdx/readme.md * Add MDX to Next * Allow Next to handle `.md` and `.mdx` files. * Add VSCode extension recommendation * Disabled TypeScript strict mode for now * Add prettier * Apply Prettier to all files * Make sure to use correct Node version * Add basic implementation for `MDXRemote` * Add experimental Rust MDX parser * Add `/public` * Add SASS support * Remove default pages and styling * Convert to module This allows to use `import/export` syntax * Add import for custom components * Add ability to load plugins * Extract function This will make the next commit easier to read * Allow to handle directories for page creation * Refactoring * Allow to parse subfolders for pages * Extract logic * Redirect `index.mdx` to parent directory * Disabled ESLint during builds * Disabled typescript during build * Remove Gatsby from `README.md` * Rephrase Docker part of `README.md` * Update project structure in `README.md` * Move and rename plugins * Update plugin for wrapping sections * Add dependencies for plugin * Use plugin * Rename wrapper type * Simplify unnessary adding of id to sections The slugified section ids are useless, because they can not be referenced anywhere anyway. The navigation only works if the section has the same id as the heading. * Add plugin for custom attributes on Markdown elements * Add plugin to readd support for tables * Add plugin to fix problem with wrapped images For more details see this issue: https://github.com/mdx-js/mdx/issues/1798 * Add necessary meta data to pages * Install necessary dependencies * Remove outdated MDX handling * Remove reliance on `InlineList` * Use existing Remark components * Remove unallowed heading Before `h1` components where not overwritten and would never have worked and they aren't used anywhere either. * Add missing components to MDX * Add correct styling * Fix broken list * Fix broken CSS classes * Implement layout * Fix links * Fix broken images * Fix pattern image * Fix heading attributes * Rename heading attribute `new` was causing some weird issue, so renaming it to `version` * Update comment syntax in MDX * Merge imports * Fix markdown rendering inside components * Add model pages * Simplify anchors * Fix default value for theme * Add Universe index page * Add Universe categories * Add Universe projects * Fix Next problem with copy Next complains when the server renders something different then the client, therfor we move the differing logic to `useEffect` * Fix improper component nesting Next doesn't allow block elements inside a `<p>` * Replace landing page MDX with page component * Remove inlined iframe content * Remove ability to inline HTML content in iFrames * Remove MDX imports * Fix problem with image inside link in MDX * Escape character for MDX * Fix unescaped characters in MDX * Fix headings with logo * Allow to export static HTML pages * Add prebuild script This command is automatically run by Next * Replace `svg-loader` with `react-inlinesvg` `svg-loader` is no longer maintained * Fix ESLint `react-hooks/exhaustive-deps` * Fix dropdowns * Change code language from `cli` to `bash` * Remove unnessary language `none` * Fix invalid code language `markdown_` with an underscore was used to basically turn of syntax highlighting, but using unknown languages know throws an error. * Enable code blocks plugin * Readd `InlineCode` component MDX2 removed the `inlineCode` component > The special component name `inlineCode` was removed, we recommend to use `pre` for the block version of code, and code for both the block and inline versions Source: https://mdxjs.com/migrating/v2/#update-mdx-content * Remove unused code * Extract function to own file * Fix code syntax highlighting * Update syntax for code block meta data * Remove unused prop * Fix internal link recognition There is a problem with regex between Node and browser, and since Next runs the component on both, this create an error. `Prop `rel` did not match. Server: "null" Client: "noopener nofollow noreferrer"` This simplifies the implementation and fixes the above error. * Replace `react-helmet` with `next/head` * Fix `className` problem for JSX component * Fix broken bold markdown * Convert file to `.mjs` to be used by Node process * Add plugin to replace strings * Fix custom table row styling * Fix problem with `span` inside inline `code` React doesn't allow a `span` inside an inline `code` element and throws an error in dev mode. * Add `_document` to be able to customize `<html>` and `<body>` * Add `lang="en"` * Store Netlify settings in file This way we don't need to update via Netlify UI, which can be tricky if changing build settings. * Add sitemap * Add Smartypants * Add PWA support * Add `manifest.webmanifest` * Fix bug with anchor links after reloading There was no need for the previous implementation, since the browser handles this nativly. Additional the manual scrolling into view was actually broken, because the heading would disappear behind the menu bar. * Rename custom event I was googeling for ages to find out what kind of event `inview` is, only to figure out it was a custom event with a name that sounds pretty much like a native one. 🫠 * Fix missing comment syntax highlighting * Refactor Quickstart component The previous implementation was hidding the irrelevant lines via data-props and dynamically generated CSS. This created problems with Next and was also hard to follow. CSS was used to do what React is supposed to handle. The new implementation simplfy filters the list of children (React elements) via their props. * Fix syntax highlighting for Training Quickstart * Unify code rendering * Improve error logging in Juniper * Fix Juniper component * Automatically generate "Read Next" link * Add Plausible * Use recent DocSearch component and adjust styling * Fix images * Turn of image optimization > Image Optimization using Next.js' default loader is not compatible with `next export`. We currently deploy to Netlify via `next export` * Dont build pages starting with `_` * Remove unused files * Add Next plugin to Netlify * Fix button layout MDX automatically adds `p` tags around text on a new line and Prettier wants to put the text on a new line. Hacking with JSX string. * Add 404 page * Apply Prettier * Update Prettier for `package.json` Next sometimes wants to patch `package-lock.json`. The old Prettier setting indended with 4 spaces, but Next always indends with 2 spaces. Since `npm install` automatically uses the indendation from `package.json` for `package-lock.json` and to avoid the format switching back and forth, both files are now set to 2 spaces. * Apply Next patch to `package-lock.json` When starting the dev server Next would warn `warn - Found lockfile missing swc dependencies, patching...` and update the `package-lock.json`. These are the patched changes. * fix link Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * small backslash fixes * adjust to new style Co-authored-by: Marcus Blättermann <marcus@essenmitsosse.de>
2023-01-11 19:30:07 +03:00
{/* TODO: Once rehearsal is tested, mention it here. */}
2020-08-18 15:39:40 +03:00
Website migration from Gatsby to Next (#12058) * Rename all MDX file to `.mdx` * Lock current node version (#11885) * Apply Prettier (#11996) * Minor website fixes (#11974) [ci skip] * fix table * Migrate to Next WEB-17 (#12005) * Initial commit * Run `npx create-next-app@13 next-blog` * Install MDX packages Following: https://github.com/vercel/next.js/blob/77b5f79a4dff453abb62346bf75b14d859539b81/packages/next-mdx/readme.md * Add MDX to Next * Allow Next to handle `.md` and `.mdx` files. * Add VSCode extension recommendation * Disabled TypeScript strict mode for now * Add prettier * Apply Prettier to all files * Make sure to use correct Node version * Add basic implementation for `MDXRemote` * Add experimental Rust MDX parser * Add `/public` * Add SASS support * Remove default pages and styling * Convert to module This allows to use `import/export` syntax * Add import for custom components * Add ability to load plugins * Extract function This will make the next commit easier to read * Allow to handle directories for page creation * Refactoring * Allow to parse subfolders for pages * Extract logic * Redirect `index.mdx` to parent directory * Disabled ESLint during builds * Disabled typescript during build * Remove Gatsby from `README.md` * Rephrase Docker part of `README.md` * Update project structure in `README.md` * Move and rename plugins * Update plugin for wrapping sections * Add dependencies for plugin * Use plugin * Rename wrapper type * Simplify unnessary adding of id to sections The slugified section ids are useless, because they can not be referenced anywhere anyway. The navigation only works if the section has the same id as the heading. * Add plugin for custom attributes on Markdown elements * Add plugin to readd support for tables * Add plugin to fix problem with wrapped images For more details see this issue: https://github.com/mdx-js/mdx/issues/1798 * Add necessary meta data to pages * Install necessary dependencies * Remove outdated MDX handling * Remove reliance on `InlineList` * Use existing Remark components * Remove unallowed heading Before `h1` components where not overwritten and would never have worked and they aren't used anywhere either. * Add missing components to MDX * Add correct styling * Fix broken list * Fix broken CSS classes * Implement layout * Fix links * Fix broken images * Fix pattern image * Fix heading attributes * Rename heading attribute `new` was causing some weird issue, so renaming it to `version` * Update comment syntax in MDX * Merge imports * Fix markdown rendering inside components * Add model pages * Simplify anchors * Fix default value for theme * Add Universe index page * Add Universe categories * Add Universe projects * Fix Next problem with copy Next complains when the server renders something different then the client, therfor we move the differing logic to `useEffect` * Fix improper component nesting Next doesn't allow block elements inside a `<p>` * Replace landing page MDX with page component * Remove inlined iframe content * Remove ability to inline HTML content in iFrames * Remove MDX imports * Fix problem with image inside link in MDX * Escape character for MDX * Fix unescaped characters in MDX * Fix headings with logo * Allow to export static HTML pages * Add prebuild script This command is automatically run by Next * Replace `svg-loader` with `react-inlinesvg` `svg-loader` is no longer maintained * Fix ESLint `react-hooks/exhaustive-deps` * Fix dropdowns * Change code language from `cli` to `bash` * Remove unnessary language `none` * Fix invalid code language `markdown_` with an underscore was used to basically turn of syntax highlighting, but using unknown languages know throws an error. * Enable code blocks plugin * Readd `InlineCode` component MDX2 removed the `inlineCode` component > The special component name `inlineCode` was removed, we recommend to use `pre` for the block version of code, and code for both the block and inline versions Source: https://mdxjs.com/migrating/v2/#update-mdx-content * Remove unused code * Extract function to own file * Fix code syntax highlighting * Update syntax for code block meta data * Remove unused prop * Fix internal link recognition There is a problem with regex between Node and browser, and since Next runs the component on both, this create an error. `Prop `rel` did not match. Server: "null" Client: "noopener nofollow noreferrer"` This simplifies the implementation and fixes the above error. * Replace `react-helmet` with `next/head` * Fix `className` problem for JSX component * Fix broken bold markdown * Convert file to `.mjs` to be used by Node process * Add plugin to replace strings * Fix custom table row styling * Fix problem with `span` inside inline `code` React doesn't allow a `span` inside an inline `code` element and throws an error in dev mode. * Add `_document` to be able to customize `<html>` and `<body>` * Add `lang="en"` * Store Netlify settings in file This way we don't need to update via Netlify UI, which can be tricky if changing build settings. * Add sitemap * Add Smartypants * Add PWA support * Add `manifest.webmanifest` * Fix bug with anchor links after reloading There was no need for the previous implementation, since the browser handles this nativly. Additional the manual scrolling into view was actually broken, because the heading would disappear behind the menu bar. * Rename custom event I was googeling for ages to find out what kind of event `inview` is, only to figure out it was a custom event with a name that sounds pretty much like a native one. 🫠 * Fix missing comment syntax highlighting * Refactor Quickstart component The previous implementation was hidding the irrelevant lines via data-props and dynamically generated CSS. This created problems with Next and was also hard to follow. CSS was used to do what React is supposed to handle. The new implementation simplfy filters the list of children (React elements) via their props. * Fix syntax highlighting for Training Quickstart * Unify code rendering * Improve error logging in Juniper * Fix Juniper component * Automatically generate "Read Next" link * Add Plausible * Use recent DocSearch component and adjust styling * Fix images * Turn of image optimization > Image Optimization using Next.js' default loader is not compatible with `next export`. We currently deploy to Netlify via `next export` * Dont build pages starting with `_` * Remove unused files * Add Next plugin to Netlify * Fix button layout MDX automatically adds `p` tags around text on a new line and Prettier wants to put the text on a new line. Hacking with JSX string. * Add 404 page * Apply Prettier * Update Prettier for `package.json` Next sometimes wants to patch `package-lock.json`. The old Prettier setting indended with 4 spaces, but Next always indends with 2 spaces. Since `npm install` automatically uses the indendation from `package.json` for `package-lock.json` and to avoid the format switching back and forth, both files are now set to 2 spaces. * Apply Next patch to `package-lock.json` When starting the dev server Next would warn `warn - Found lockfile missing swc dependencies, patching...` and update the `package-lock.json`. These are the patched changes. * fix link Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * small backslash fixes * adjust to new style Co-authored-by: Marcus Blättermann <marcus@essenmitsosse.de>
2023-01-11 19:30:07 +03:00
## Using transformer models {id="transformers"}
2020-08-18 01:49:19 +03:00
Transformers are a family of neural network architectures that compute **dense,
context-sensitive representations** for the tokens in your documents. Downstream
models in your pipeline can then use these representations as input features to
**improve their predictions**. You can connect multiple components to a single
transformer model, with any or all of those components giving feedback to the
transformer to fine-tune it to your tasks. spaCy's transformer support
interoperates with [PyTorch](https://pytorch.org) and the
[HuggingFace `transformers`](https://huggingface.co/transformers/) library,
giving you access to thousands of pretrained models for your pipelines. There
are many [great guides](http://jalammar.github.io/illustrated-transformer/) to
transformer models, but for practical purposes, you can simply think of them as
2020-08-26 12:51:57 +03:00
drop-in replacements that let you achieve **higher accuracy** in exchange for
2020-08-18 01:49:19 +03:00
**higher training and runtime costs**.
Website migration from Gatsby to Next (#12058) * Rename all MDX file to `.mdx` * Lock current node version (#11885) * Apply Prettier (#11996) * Minor website fixes (#11974) [ci skip] * fix table * Migrate to Next WEB-17 (#12005) * Initial commit * Run `npx create-next-app@13 next-blog` * Install MDX packages Following: https://github.com/vercel/next.js/blob/77b5f79a4dff453abb62346bf75b14d859539b81/packages/next-mdx/readme.md * Add MDX to Next * Allow Next to handle `.md` and `.mdx` files. * Add VSCode extension recommendation * Disabled TypeScript strict mode for now * Add prettier * Apply Prettier to all files * Make sure to use correct Node version * Add basic implementation for `MDXRemote` * Add experimental Rust MDX parser * Add `/public` * Add SASS support * Remove default pages and styling * Convert to module This allows to use `import/export` syntax * Add import for custom components * Add ability to load plugins * Extract function This will make the next commit easier to read * Allow to handle directories for page creation * Refactoring * Allow to parse subfolders for pages * Extract logic * Redirect `index.mdx` to parent directory * Disabled ESLint during builds * Disabled typescript during build * Remove Gatsby from `README.md` * Rephrase Docker part of `README.md` * Update project structure in `README.md` * Move and rename plugins * Update plugin for wrapping sections * Add dependencies for plugin * Use plugin * Rename wrapper type * Simplify unnessary adding of id to sections The slugified section ids are useless, because they can not be referenced anywhere anyway. The navigation only works if the section has the same id as the heading. * Add plugin for custom attributes on Markdown elements * Add plugin to readd support for tables * Add plugin to fix problem with wrapped images For more details see this issue: https://github.com/mdx-js/mdx/issues/1798 * Add necessary meta data to pages * Install necessary dependencies * Remove outdated MDX handling * Remove reliance on `InlineList` * Use existing Remark components * Remove unallowed heading Before `h1` components where not overwritten and would never have worked and they aren't used anywhere either. * Add missing components to MDX * Add correct styling * Fix broken list * Fix broken CSS classes * Implement layout * Fix links * Fix broken images * Fix pattern image * Fix heading attributes * Rename heading attribute `new` was causing some weird issue, so renaming it to `version` * Update comment syntax in MDX * Merge imports * Fix markdown rendering inside components * Add model pages * Simplify anchors * Fix default value for theme * Add Universe index page * Add Universe categories * Add Universe projects * Fix Next problem with copy Next complains when the server renders something different then the client, therfor we move the differing logic to `useEffect` * Fix improper component nesting Next doesn't allow block elements inside a `<p>` * Replace landing page MDX with page component * Remove inlined iframe content * Remove ability to inline HTML content in iFrames * Remove MDX imports * Fix problem with image inside link in MDX * Escape character for MDX * Fix unescaped characters in MDX * Fix headings with logo * Allow to export static HTML pages * Add prebuild script This command is automatically run by Next * Replace `svg-loader` with `react-inlinesvg` `svg-loader` is no longer maintained * Fix ESLint `react-hooks/exhaustive-deps` * Fix dropdowns * Change code language from `cli` to `bash` * Remove unnessary language `none` * Fix invalid code language `markdown_` with an underscore was used to basically turn of syntax highlighting, but using unknown languages know throws an error. * Enable code blocks plugin * Readd `InlineCode` component MDX2 removed the `inlineCode` component > The special component name `inlineCode` was removed, we recommend to use `pre` for the block version of code, and code for both the block and inline versions Source: https://mdxjs.com/migrating/v2/#update-mdx-content * Remove unused code * Extract function to own file * Fix code syntax highlighting * Update syntax for code block meta data * Remove unused prop * Fix internal link recognition There is a problem with regex between Node and browser, and since Next runs the component on both, this create an error. `Prop `rel` did not match. Server: "null" Client: "noopener nofollow noreferrer"` This simplifies the implementation and fixes the above error. * Replace `react-helmet` with `next/head` * Fix `className` problem for JSX component * Fix broken bold markdown * Convert file to `.mjs` to be used by Node process * Add plugin to replace strings * Fix custom table row styling * Fix problem with `span` inside inline `code` React doesn't allow a `span` inside an inline `code` element and throws an error in dev mode. * Add `_document` to be able to customize `<html>` and `<body>` * Add `lang="en"` * Store Netlify settings in file This way we don't need to update via Netlify UI, which can be tricky if changing build settings. * Add sitemap * Add Smartypants * Add PWA support * Add `manifest.webmanifest` * Fix bug with anchor links after reloading There was no need for the previous implementation, since the browser handles this nativly. Additional the manual scrolling into view was actually broken, because the heading would disappear behind the menu bar. * Rename custom event I was googeling for ages to find out what kind of event `inview` is, only to figure out it was a custom event with a name that sounds pretty much like a native one. 🫠 * Fix missing comment syntax highlighting * Refactor Quickstart component The previous implementation was hidding the irrelevant lines via data-props and dynamically generated CSS. This created problems with Next and was also hard to follow. CSS was used to do what React is supposed to handle. The new implementation simplfy filters the list of children (React elements) via their props. * Fix syntax highlighting for Training Quickstart * Unify code rendering * Improve error logging in Juniper * Fix Juniper component * Automatically generate "Read Next" link * Add Plausible * Use recent DocSearch component and adjust styling * Fix images * Turn of image optimization > Image Optimization using Next.js' default loader is not compatible with `next export`. We currently deploy to Netlify via `next export` * Dont build pages starting with `_` * Remove unused files * Add Next plugin to Netlify * Fix button layout MDX automatically adds `p` tags around text on a new line and Prettier wants to put the text on a new line. Hacking with JSX string. * Add 404 page * Apply Prettier * Update Prettier for `package.json` Next sometimes wants to patch `package-lock.json`. The old Prettier setting indended with 4 spaces, but Next always indends with 2 spaces. Since `npm install` automatically uses the indendation from `package.json` for `package-lock.json` and to avoid the format switching back and forth, both files are now set to 2 spaces. * Apply Next patch to `package-lock.json` When starting the dev server Next would warn `warn - Found lockfile missing swc dependencies, patching...` and update the `package-lock.json`. These are the patched changes. * fix link Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * small backslash fixes * adjust to new style Co-authored-by: Marcus Blättermann <marcus@essenmitsosse.de>
2023-01-11 19:30:07 +03:00
### Setup and installation {id="transformers-installation"}
2020-08-18 01:49:19 +03:00
> #### System requirements
>
> We recommend an NVIDIA **GPU** with at least **10GB of memory** in order to
> work with transformer models. Make sure your GPU drivers are up to date and
> you have **CUDA v9+** installed.
> The exact requirements will depend on the transformer model. Training a
> transformer-based model without a GPU will be too slow for most practical
> purposes.
>
> Provisioning a new machine will require about **5GB** of data to be
> downloaded: 3GB CUDA runtime, 800MB PyTorch, 400MB CuPy, 500MB weights, 200MB
> spaCy and dependencies.
Once you have CUDA installed, we recommend installing PyTorch following the
[PyTorch installation guidelines](https://pytorch.org/get-started/locally/) for
your package manager and CUDA version. If you skip this step, pip will install
PyTorch as a dependency below, but it may not find the best version for your
setup.
Website migration from Gatsby to Next (#12058) * Rename all MDX file to `.mdx` * Lock current node version (#11885) * Apply Prettier (#11996) * Minor website fixes (#11974) [ci skip] * fix table * Migrate to Next WEB-17 (#12005) * Initial commit * Run `npx create-next-app@13 next-blog` * Install MDX packages Following: https://github.com/vercel/next.js/blob/77b5f79a4dff453abb62346bf75b14d859539b81/packages/next-mdx/readme.md * Add MDX to Next * Allow Next to handle `.md` and `.mdx` files. * Add VSCode extension recommendation * Disabled TypeScript strict mode for now * Add prettier * Apply Prettier to all files * Make sure to use correct Node version * Add basic implementation for `MDXRemote` * Add experimental Rust MDX parser * Add `/public` * Add SASS support * Remove default pages and styling * Convert to module This allows to use `import/export` syntax * Add import for custom components * Add ability to load plugins * Extract function This will make the next commit easier to read * Allow to handle directories for page creation * Refactoring * Allow to parse subfolders for pages * Extract logic * Redirect `index.mdx` to parent directory * Disabled ESLint during builds * Disabled typescript during build * Remove Gatsby from `README.md` * Rephrase Docker part of `README.md` * Update project structure in `README.md` * Move and rename plugins * Update plugin for wrapping sections * Add dependencies for plugin * Use plugin * Rename wrapper type * Simplify unnessary adding of id to sections The slugified section ids are useless, because they can not be referenced anywhere anyway. The navigation only works if the section has the same id as the heading. * Add plugin for custom attributes on Markdown elements * Add plugin to readd support for tables * Add plugin to fix problem with wrapped images For more details see this issue: https://github.com/mdx-js/mdx/issues/1798 * Add necessary meta data to pages * Install necessary dependencies * Remove outdated MDX handling * Remove reliance on `InlineList` * Use existing Remark components * Remove unallowed heading Before `h1` components where not overwritten and would never have worked and they aren't used anywhere either. * Add missing components to MDX * Add correct styling * Fix broken list * Fix broken CSS classes * Implement layout * Fix links * Fix broken images * Fix pattern image * Fix heading attributes * Rename heading attribute `new` was causing some weird issue, so renaming it to `version` * Update comment syntax in MDX * Merge imports * Fix markdown rendering inside components * Add model pages * Simplify anchors * Fix default value for theme * Add Universe index page * Add Universe categories * Add Universe projects * Fix Next problem with copy Next complains when the server renders something different then the client, therfor we move the differing logic to `useEffect` * Fix improper component nesting Next doesn't allow block elements inside a `<p>` * Replace landing page MDX with page component * Remove inlined iframe content * Remove ability to inline HTML content in iFrames * Remove MDX imports * Fix problem with image inside link in MDX * Escape character for MDX * Fix unescaped characters in MDX * Fix headings with logo * Allow to export static HTML pages * Add prebuild script This command is automatically run by Next * Replace `svg-loader` with `react-inlinesvg` `svg-loader` is no longer maintained * Fix ESLint `react-hooks/exhaustive-deps` * Fix dropdowns * Change code language from `cli` to `bash` * Remove unnessary language `none` * Fix invalid code language `markdown_` with an underscore was used to basically turn of syntax highlighting, but using unknown languages know throws an error. * Enable code blocks plugin * Readd `InlineCode` component MDX2 removed the `inlineCode` component > The special component name `inlineCode` was removed, we recommend to use `pre` for the block version of code, and code for both the block and inline versions Source: https://mdxjs.com/migrating/v2/#update-mdx-content * Remove unused code * Extract function to own file * Fix code syntax highlighting * Update syntax for code block meta data * Remove unused prop * Fix internal link recognition There is a problem with regex between Node and browser, and since Next runs the component on both, this create an error. `Prop `rel` did not match. Server: "null" Client: "noopener nofollow noreferrer"` This simplifies the implementation and fixes the above error. * Replace `react-helmet` with `next/head` * Fix `className` problem for JSX component * Fix broken bold markdown * Convert file to `.mjs` to be used by Node process * Add plugin to replace strings * Fix custom table row styling * Fix problem with `span` inside inline `code` React doesn't allow a `span` inside an inline `code` element and throws an error in dev mode. * Add `_document` to be able to customize `<html>` and `<body>` * Add `lang="en"` * Store Netlify settings in file This way we don't need to update via Netlify UI, which can be tricky if changing build settings. * Add sitemap * Add Smartypants * Add PWA support * Add `manifest.webmanifest` * Fix bug with anchor links after reloading There was no need for the previous implementation, since the browser handles this nativly. Additional the manual scrolling into view was actually broken, because the heading would disappear behind the menu bar. * Rename custom event I was googeling for ages to find out what kind of event `inview` is, only to figure out it was a custom event with a name that sounds pretty much like a native one. 🫠 * Fix missing comment syntax highlighting * Refactor Quickstart component The previous implementation was hidding the irrelevant lines via data-props and dynamically generated CSS. This created problems with Next and was also hard to follow. CSS was used to do what React is supposed to handle. The new implementation simplfy filters the list of children (React elements) via their props. * Fix syntax highlighting for Training Quickstart * Unify code rendering * Improve error logging in Juniper * Fix Juniper component * Automatically generate "Read Next" link * Add Plausible * Use recent DocSearch component and adjust styling * Fix images * Turn of image optimization > Image Optimization using Next.js' default loader is not compatible with `next export`. We currently deploy to Netlify via `next export` * Dont build pages starting with `_` * Remove unused files * Add Next plugin to Netlify * Fix button layout MDX automatically adds `p` tags around text on a new line and Prettier wants to put the text on a new line. Hacking with JSX string. * Add 404 page * Apply Prettier * Update Prettier for `package.json` Next sometimes wants to patch `package-lock.json`. The old Prettier setting indended with 4 spaces, but Next always indends with 2 spaces. Since `npm install` automatically uses the indendation from `package.json` for `package-lock.json` and to avoid the format switching back and forth, both files are now set to 2 spaces. * Apply Next patch to `package-lock.json` When starting the dev server Next would warn `warn - Found lockfile missing swc dependencies, patching...` and update the `package-lock.json`. These are the patched changes. * fix link Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * small backslash fixes * adjust to new style Co-authored-by: Marcus Blättermann <marcus@essenmitsosse.de>
2023-01-11 19:30:07 +03:00
```bash {title="Example: Install PyTorch 1.11.0 for CUDA 11.3 with pip"}
# See: https://pytorch.org/get-started/locally/
$ pip install torch==1.11.0+cu113 torchvision==0.12.0+cu113 torchaudio==0.11.0+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html
```
Next, install spaCy with the extras for your CUDA version and transformers. The
CUDA extra (e.g., `cuda102`, `cuda113`) installs the correct version of
Website migration from Gatsby to Next (#12058) * Rename all MDX file to `.mdx` * Lock current node version (#11885) * Apply Prettier (#11996) * Minor website fixes (#11974) [ci skip] * fix table * Migrate to Next WEB-17 (#12005) * Initial commit * Run `npx create-next-app@13 next-blog` * Install MDX packages Following: https://github.com/vercel/next.js/blob/77b5f79a4dff453abb62346bf75b14d859539b81/packages/next-mdx/readme.md * Add MDX to Next * Allow Next to handle `.md` and `.mdx` files. * Add VSCode extension recommendation * Disabled TypeScript strict mode for now * Add prettier * Apply Prettier to all files * Make sure to use correct Node version * Add basic implementation for `MDXRemote` * Add experimental Rust MDX parser * Add `/public` * Add SASS support * Remove default pages and styling * Convert to module This allows to use `import/export` syntax * Add import for custom components * Add ability to load plugins * Extract function This will make the next commit easier to read * Allow to handle directories for page creation * Refactoring * Allow to parse subfolders for pages * Extract logic * Redirect `index.mdx` to parent directory * Disabled ESLint during builds * Disabled typescript during build * Remove Gatsby from `README.md` * Rephrase Docker part of `README.md` * Update project structure in `README.md` * Move and rename plugins * Update plugin for wrapping sections * Add dependencies for plugin * Use plugin * Rename wrapper type * Simplify unnessary adding of id to sections The slugified section ids are useless, because they can not be referenced anywhere anyway. The navigation only works if the section has the same id as the heading. * Add plugin for custom attributes on Markdown elements * Add plugin to readd support for tables * Add plugin to fix problem with wrapped images For more details see this issue: https://github.com/mdx-js/mdx/issues/1798 * Add necessary meta data to pages * Install necessary dependencies * Remove outdated MDX handling * Remove reliance on `InlineList` * Use existing Remark components * Remove unallowed heading Before `h1` components where not overwritten and would never have worked and they aren't used anywhere either. * Add missing components to MDX * Add correct styling * Fix broken list * Fix broken CSS classes * Implement layout * Fix links * Fix broken images * Fix pattern image * Fix heading attributes * Rename heading attribute `new` was causing some weird issue, so renaming it to `version` * Update comment syntax in MDX * Merge imports * Fix markdown rendering inside components * Add model pages * Simplify anchors * Fix default value for theme * Add Universe index page * Add Universe categories * Add Universe projects * Fix Next problem with copy Next complains when the server renders something different then the client, therfor we move the differing logic to `useEffect` * Fix improper component nesting Next doesn't allow block elements inside a `<p>` * Replace landing page MDX with page component * Remove inlined iframe content * Remove ability to inline HTML content in iFrames * Remove MDX imports * Fix problem with image inside link in MDX * Escape character for MDX * Fix unescaped characters in MDX * Fix headings with logo * Allow to export static HTML pages * Add prebuild script This command is automatically run by Next * Replace `svg-loader` with `react-inlinesvg` `svg-loader` is no longer maintained * Fix ESLint `react-hooks/exhaustive-deps` * Fix dropdowns * Change code language from `cli` to `bash` * Remove unnessary language `none` * Fix invalid code language `markdown_` with an underscore was used to basically turn of syntax highlighting, but using unknown languages know throws an error. * Enable code blocks plugin * Readd `InlineCode` component MDX2 removed the `inlineCode` component > The special component name `inlineCode` was removed, we recommend to use `pre` for the block version of code, and code for both the block and inline versions Source: https://mdxjs.com/migrating/v2/#update-mdx-content * Remove unused code * Extract function to own file * Fix code syntax highlighting * Update syntax for code block meta data * Remove unused prop * Fix internal link recognition There is a problem with regex between Node and browser, and since Next runs the component on both, this create an error. `Prop `rel` did not match. Server: "null" Client: "noopener nofollow noreferrer"` This simplifies the implementation and fixes the above error. * Replace `react-helmet` with `next/head` * Fix `className` problem for JSX component * Fix broken bold markdown * Convert file to `.mjs` to be used by Node process * Add plugin to replace strings * Fix custom table row styling * Fix problem with `span` inside inline `code` React doesn't allow a `span` inside an inline `code` element and throws an error in dev mode. * Add `_document` to be able to customize `<html>` and `<body>` * Add `lang="en"` * Store Netlify settings in file This way we don't need to update via Netlify UI, which can be tricky if changing build settings. * Add sitemap * Add Smartypants * Add PWA support * Add `manifest.webmanifest` * Fix bug with anchor links after reloading There was no need for the previous implementation, since the browser handles this nativly. Additional the manual scrolling into view was actually broken, because the heading would disappear behind the menu bar. * Rename custom event I was googeling for ages to find out what kind of event `inview` is, only to figure out it was a custom event with a name that sounds pretty much like a native one. 🫠 * Fix missing comment syntax highlighting * Refactor Quickstart component The previous implementation was hidding the irrelevant lines via data-props and dynamically generated CSS. This created problems with Next and was also hard to follow. CSS was used to do what React is supposed to handle. The new implementation simplfy filters the list of children (React elements) via their props. * Fix syntax highlighting for Training Quickstart * Unify code rendering * Improve error logging in Juniper * Fix Juniper component * Automatically generate "Read Next" link * Add Plausible * Use recent DocSearch component and adjust styling * Fix images * Turn of image optimization > Image Optimization using Next.js' default loader is not compatible with `next export`. We currently deploy to Netlify via `next export` * Dont build pages starting with `_` * Remove unused files * Add Next plugin to Netlify * Fix button layout MDX automatically adds `p` tags around text on a new line and Prettier wants to put the text on a new line. Hacking with JSX string. * Add 404 page * Apply Prettier * Update Prettier for `package.json` Next sometimes wants to patch `package-lock.json`. The old Prettier setting indended with 4 spaces, but Next always indends with 2 spaces. Since `npm install` automatically uses the indendation from `package.json` for `package-lock.json` and to avoid the format switching back and forth, both files are now set to 2 spaces. * Apply Next patch to `package-lock.json` When starting the dev server Next would warn `warn - Found lockfile missing swc dependencies, patching...` and update the `package-lock.json`. These are the patched changes. * fix link Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * small backslash fixes * adjust to new style Co-authored-by: Marcus Blättermann <marcus@essenmitsosse.de>
2023-01-11 19:30:07 +03:00
[`cupy`](https://docs.cupy.dev/en/stable/install.html#installing-cupy), which is
just like `numpy`, but for GPU. You may also need to set the `CUDA_PATH`
environment variable if your CUDA runtime is installed in a non-standard
location. Putting it all together, if you had installed CUDA 11.3 in
`/opt/nvidia/cuda`, you would run:
2020-08-18 01:49:19 +03:00
Website migration from Gatsby to Next (#12058) * Rename all MDX file to `.mdx` * Lock current node version (#11885) * Apply Prettier (#11996) * Minor website fixes (#11974) [ci skip] * fix table * Migrate to Next WEB-17 (#12005) * Initial commit * Run `npx create-next-app@13 next-blog` * Install MDX packages Following: https://github.com/vercel/next.js/blob/77b5f79a4dff453abb62346bf75b14d859539b81/packages/next-mdx/readme.md * Add MDX to Next * Allow Next to handle `.md` and `.mdx` files. * Add VSCode extension recommendation * Disabled TypeScript strict mode for now * Add prettier * Apply Prettier to all files * Make sure to use correct Node version * Add basic implementation for `MDXRemote` * Add experimental Rust MDX parser * Add `/public` * Add SASS support * Remove default pages and styling * Convert to module This allows to use `import/export` syntax * Add import for custom components * Add ability to load plugins * Extract function This will make the next commit easier to read * Allow to handle directories for page creation * Refactoring * Allow to parse subfolders for pages * Extract logic * Redirect `index.mdx` to parent directory * Disabled ESLint during builds * Disabled typescript during build * Remove Gatsby from `README.md` * Rephrase Docker part of `README.md` * Update project structure in `README.md` * Move and rename plugins * Update plugin for wrapping sections * Add dependencies for plugin * Use plugin * Rename wrapper type * Simplify unnessary adding of id to sections The slugified section ids are useless, because they can not be referenced anywhere anyway. The navigation only works if the section has the same id as the heading. * Add plugin for custom attributes on Markdown elements * Add plugin to readd support for tables * Add plugin to fix problem with wrapped images For more details see this issue: https://github.com/mdx-js/mdx/issues/1798 * Add necessary meta data to pages * Install necessary dependencies * Remove outdated MDX handling * Remove reliance on `InlineList` * Use existing Remark components * Remove unallowed heading Before `h1` components where not overwritten and would never have worked and they aren't used anywhere either. * Add missing components to MDX * Add correct styling * Fix broken list * Fix broken CSS classes * Implement layout * Fix links * Fix broken images * Fix pattern image * Fix heading attributes * Rename heading attribute `new` was causing some weird issue, so renaming it to `version` * Update comment syntax in MDX * Merge imports * Fix markdown rendering inside components * Add model pages * Simplify anchors * Fix default value for theme * Add Universe index page * Add Universe categories * Add Universe projects * Fix Next problem with copy Next complains when the server renders something different then the client, therfor we move the differing logic to `useEffect` * Fix improper component nesting Next doesn't allow block elements inside a `<p>` * Replace landing page MDX with page component * Remove inlined iframe content * Remove ability to inline HTML content in iFrames * Remove MDX imports * Fix problem with image inside link in MDX * Escape character for MDX * Fix unescaped characters in MDX * Fix headings with logo * Allow to export static HTML pages * Add prebuild script This command is automatically run by Next * Replace `svg-loader` with `react-inlinesvg` `svg-loader` is no longer maintained * Fix ESLint `react-hooks/exhaustive-deps` * Fix dropdowns * Change code language from `cli` to `bash` * Remove unnessary language `none` * Fix invalid code language `markdown_` with an underscore was used to basically turn of syntax highlighting, but using unknown languages know throws an error. * Enable code blocks plugin * Readd `InlineCode` component MDX2 removed the `inlineCode` component > The special component name `inlineCode` was removed, we recommend to use `pre` for the block version of code, and code for both the block and inline versions Source: https://mdxjs.com/migrating/v2/#update-mdx-content * Remove unused code * Extract function to own file * Fix code syntax highlighting * Update syntax for code block meta data * Remove unused prop * Fix internal link recognition There is a problem with regex between Node and browser, and since Next runs the component on both, this create an error. `Prop `rel` did not match. Server: "null" Client: "noopener nofollow noreferrer"` This simplifies the implementation and fixes the above error. * Replace `react-helmet` with `next/head` * Fix `className` problem for JSX component * Fix broken bold markdown * Convert file to `.mjs` to be used by Node process * Add plugin to replace strings * Fix custom table row styling * Fix problem with `span` inside inline `code` React doesn't allow a `span` inside an inline `code` element and throws an error in dev mode. * Add `_document` to be able to customize `<html>` and `<body>` * Add `lang="en"` * Store Netlify settings in file This way we don't need to update via Netlify UI, which can be tricky if changing build settings. * Add sitemap * Add Smartypants * Add PWA support * Add `manifest.webmanifest` * Fix bug with anchor links after reloading There was no need for the previous implementation, since the browser handles this nativly. Additional the manual scrolling into view was actually broken, because the heading would disappear behind the menu bar. * Rename custom event I was googeling for ages to find out what kind of event `inview` is, only to figure out it was a custom event with a name that sounds pretty much like a native one. 🫠 * Fix missing comment syntax highlighting * Refactor Quickstart component The previous implementation was hidding the irrelevant lines via data-props and dynamically generated CSS. This created problems with Next and was also hard to follow. CSS was used to do what React is supposed to handle. The new implementation simplfy filters the list of children (React elements) via their props. * Fix syntax highlighting for Training Quickstart * Unify code rendering * Improve error logging in Juniper * Fix Juniper component * Automatically generate "Read Next" link * Add Plausible * Use recent DocSearch component and adjust styling * Fix images * Turn of image optimization > Image Optimization using Next.js' default loader is not compatible with `next export`. We currently deploy to Netlify via `next export` * Dont build pages starting with `_` * Remove unused files * Add Next plugin to Netlify * Fix button layout MDX automatically adds `p` tags around text on a new line and Prettier wants to put the text on a new line. Hacking with JSX string. * Add 404 page * Apply Prettier * Update Prettier for `package.json` Next sometimes wants to patch `package-lock.json`. The old Prettier setting indended with 4 spaces, but Next always indends with 2 spaces. Since `npm install` automatically uses the indendation from `package.json` for `package-lock.json` and to avoid the format switching back and forth, both files are now set to 2 spaces. * Apply Next patch to `package-lock.json` When starting the dev server Next would warn `warn - Found lockfile missing swc dependencies, patching...` and update the `package-lock.json`. These are the patched changes. * fix link Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * small backslash fixes * adjust to new style Co-authored-by: Marcus Blättermann <marcus@essenmitsosse.de>
2023-01-11 19:30:07 +03:00
```bash {title="Installation with CUDA"}
2020-08-19 01:28:37 +03:00
$ export CUDA_PATH="/opt/nvidia/cuda"
$ pip install -U %%SPACY_PKG_NAME[cuda113,transformers]%%SPACY_PKG_FLAGS
2020-08-18 01:49:19 +03:00
```
For [`transformers`](https://huggingface.co/transformers/) v4.0.0+ and models
that require [`SentencePiece`](https://github.com/google/sentencepiece) (e.g.,
ALBERT, CamemBERT, XLNet, Marian, and T5), install the additional dependencies
with:
Website migration from Gatsby to Next (#12058) * Rename all MDX file to `.mdx` * Lock current node version (#11885) * Apply Prettier (#11996) * Minor website fixes (#11974) [ci skip] * fix table * Migrate to Next WEB-17 (#12005) * Initial commit * Run `npx create-next-app@13 next-blog` * Install MDX packages Following: https://github.com/vercel/next.js/blob/77b5f79a4dff453abb62346bf75b14d859539b81/packages/next-mdx/readme.md * Add MDX to Next * Allow Next to handle `.md` and `.mdx` files. * Add VSCode extension recommendation * Disabled TypeScript strict mode for now * Add prettier * Apply Prettier to all files * Make sure to use correct Node version * Add basic implementation for `MDXRemote` * Add experimental Rust MDX parser * Add `/public` * Add SASS support * Remove default pages and styling * Convert to module This allows to use `import/export` syntax * Add import for custom components * Add ability to load plugins * Extract function This will make the next commit easier to read * Allow to handle directories for page creation * Refactoring * Allow to parse subfolders for pages * Extract logic * Redirect `index.mdx` to parent directory * Disabled ESLint during builds * Disabled typescript during build * Remove Gatsby from `README.md` * Rephrase Docker part of `README.md` * Update project structure in `README.md` * Move and rename plugins * Update plugin for wrapping sections * Add dependencies for plugin * Use plugin * Rename wrapper type * Simplify unnessary adding of id to sections The slugified section ids are useless, because they can not be referenced anywhere anyway. The navigation only works if the section has the same id as the heading. * Add plugin for custom attributes on Markdown elements * Add plugin to readd support for tables * Add plugin to fix problem with wrapped images For more details see this issue: https://github.com/mdx-js/mdx/issues/1798 * Add necessary meta data to pages * Install necessary dependencies * Remove outdated MDX handling * Remove reliance on `InlineList` * Use existing Remark components * Remove unallowed heading Before `h1` components where not overwritten and would never have worked and they aren't used anywhere either. * Add missing components to MDX * Add correct styling * Fix broken list * Fix broken CSS classes * Implement layout * Fix links * Fix broken images * Fix pattern image * Fix heading attributes * Rename heading attribute `new` was causing some weird issue, so renaming it to `version` * Update comment syntax in MDX * Merge imports * Fix markdown rendering inside components * Add model pages * Simplify anchors * Fix default value for theme * Add Universe index page * Add Universe categories * Add Universe projects * Fix Next problem with copy Next complains when the server renders something different then the client, therfor we move the differing logic to `useEffect` * Fix improper component nesting Next doesn't allow block elements inside a `<p>` * Replace landing page MDX with page component * Remove inlined iframe content * Remove ability to inline HTML content in iFrames * Remove MDX imports * Fix problem with image inside link in MDX * Escape character for MDX * Fix unescaped characters in MDX * Fix headings with logo * Allow to export static HTML pages * Add prebuild script This command is automatically run by Next * Replace `svg-loader` with `react-inlinesvg` `svg-loader` is no longer maintained * Fix ESLint `react-hooks/exhaustive-deps` * Fix dropdowns * Change code language from `cli` to `bash` * Remove unnessary language `none` * Fix invalid code language `markdown_` with an underscore was used to basically turn of syntax highlighting, but using unknown languages know throws an error. * Enable code blocks plugin * Readd `InlineCode` component MDX2 removed the `inlineCode` component > The special component name `inlineCode` was removed, we recommend to use `pre` for the block version of code, and code for both the block and inline versions Source: https://mdxjs.com/migrating/v2/#update-mdx-content * Remove unused code * Extract function to own file * Fix code syntax highlighting * Update syntax for code block meta data * Remove unused prop * Fix internal link recognition There is a problem with regex between Node and browser, and since Next runs the component on both, this create an error. `Prop `rel` did not match. Server: "null" Client: "noopener nofollow noreferrer"` This simplifies the implementation and fixes the above error. * Replace `react-helmet` with `next/head` * Fix `className` problem for JSX component * Fix broken bold markdown * Convert file to `.mjs` to be used by Node process * Add plugin to replace strings * Fix custom table row styling * Fix problem with `span` inside inline `code` React doesn't allow a `span` inside an inline `code` element and throws an error in dev mode. * Add `_document` to be able to customize `<html>` and `<body>` * Add `lang="en"` * Store Netlify settings in file This way we don't need to update via Netlify UI, which can be tricky if changing build settings. * Add sitemap * Add Smartypants * Add PWA support * Add `manifest.webmanifest` * Fix bug with anchor links after reloading There was no need for the previous implementation, since the browser handles this nativly. Additional the manual scrolling into view was actually broken, because the heading would disappear behind the menu bar. * Rename custom event I was googeling for ages to find out what kind of event `inview` is, only to figure out it was a custom event with a name that sounds pretty much like a native one. 🫠 * Fix missing comment syntax highlighting * Refactor Quickstart component The previous implementation was hidding the irrelevant lines via data-props and dynamically generated CSS. This created problems with Next and was also hard to follow. CSS was used to do what React is supposed to handle. The new implementation simplfy filters the list of children (React elements) via their props. * Fix syntax highlighting for Training Quickstart * Unify code rendering * Improve error logging in Juniper * Fix Juniper component * Automatically generate "Read Next" link * Add Plausible * Use recent DocSearch component and adjust styling * Fix images * Turn of image optimization > Image Optimization using Next.js' default loader is not compatible with `next export`. We currently deploy to Netlify via `next export` * Dont build pages starting with `_` * Remove unused files * Add Next plugin to Netlify * Fix button layout MDX automatically adds `p` tags around text on a new line and Prettier wants to put the text on a new line. Hacking with JSX string. * Add 404 page * Apply Prettier * Update Prettier for `package.json` Next sometimes wants to patch `package-lock.json`. The old Prettier setting indended with 4 spaces, but Next always indends with 2 spaces. Since `npm install` automatically uses the indendation from `package.json` for `package-lock.json` and to avoid the format switching back and forth, both files are now set to 2 spaces. * Apply Next patch to `package-lock.json` When starting the dev server Next would warn `warn - Found lockfile missing swc dependencies, patching...` and update the `package-lock.json`. These are the patched changes. * fix link Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * small backslash fixes * adjust to new style Co-authored-by: Marcus Blättermann <marcus@essenmitsosse.de>
2023-01-11 19:30:07 +03:00
```bash {title="Install sentencepiece"}
$ pip install transformers[sentencepiece]
```
Website migration from Gatsby to Next (#12058) * Rename all MDX file to `.mdx` * Lock current node version (#11885) * Apply Prettier (#11996) * Minor website fixes (#11974) [ci skip] * fix table * Migrate to Next WEB-17 (#12005) * Initial commit * Run `npx create-next-app@13 next-blog` * Install MDX packages Following: https://github.com/vercel/next.js/blob/77b5f79a4dff453abb62346bf75b14d859539b81/packages/next-mdx/readme.md * Add MDX to Next * Allow Next to handle `.md` and `.mdx` files. * Add VSCode extension recommendation * Disabled TypeScript strict mode for now * Add prettier * Apply Prettier to all files * Make sure to use correct Node version * Add basic implementation for `MDXRemote` * Add experimental Rust MDX parser * Add `/public` * Add SASS support * Remove default pages and styling * Convert to module This allows to use `import/export` syntax * Add import for custom components * Add ability to load plugins * Extract function This will make the next commit easier to read * Allow to handle directories for page creation * Refactoring * Allow to parse subfolders for pages * Extract logic * Redirect `index.mdx` to parent directory * Disabled ESLint during builds * Disabled typescript during build * Remove Gatsby from `README.md` * Rephrase Docker part of `README.md` * Update project structure in `README.md` * Move and rename plugins * Update plugin for wrapping sections * Add dependencies for plugin * Use plugin * Rename wrapper type * Simplify unnessary adding of id to sections The slugified section ids are useless, because they can not be referenced anywhere anyway. The navigation only works if the section has the same id as the heading. * Add plugin for custom attributes on Markdown elements * Add plugin to readd support for tables * Add plugin to fix problem with wrapped images For more details see this issue: https://github.com/mdx-js/mdx/issues/1798 * Add necessary meta data to pages * Install necessary dependencies * Remove outdated MDX handling * Remove reliance on `InlineList` * Use existing Remark components * Remove unallowed heading Before `h1` components where not overwritten and would never have worked and they aren't used anywhere either. * Add missing components to MDX * Add correct styling * Fix broken list * Fix broken CSS classes * Implement layout * Fix links * Fix broken images * Fix pattern image * Fix heading attributes * Rename heading attribute `new` was causing some weird issue, so renaming it to `version` * Update comment syntax in MDX * Merge imports * Fix markdown rendering inside components * Add model pages * Simplify anchors * Fix default value for theme * Add Universe index page * Add Universe categories * Add Universe projects * Fix Next problem with copy Next complains when the server renders something different then the client, therfor we move the differing logic to `useEffect` * Fix improper component nesting Next doesn't allow block elements inside a `<p>` * Replace landing page MDX with page component * Remove inlined iframe content * Remove ability to inline HTML content in iFrames * Remove MDX imports * Fix problem with image inside link in MDX * Escape character for MDX * Fix unescaped characters in MDX * Fix headings with logo * Allow to export static HTML pages * Add prebuild script This command is automatically run by Next * Replace `svg-loader` with `react-inlinesvg` `svg-loader` is no longer maintained * Fix ESLint `react-hooks/exhaustive-deps` * Fix dropdowns * Change code language from `cli` to `bash` * Remove unnessary language `none` * Fix invalid code language `markdown_` with an underscore was used to basically turn of syntax highlighting, but using unknown languages know throws an error. * Enable code blocks plugin * Readd `InlineCode` component MDX2 removed the `inlineCode` component > The special component name `inlineCode` was removed, we recommend to use `pre` for the block version of code, and code for both the block and inline versions Source: https://mdxjs.com/migrating/v2/#update-mdx-content * Remove unused code * Extract function to own file * Fix code syntax highlighting * Update syntax for code block meta data * Remove unused prop * Fix internal link recognition There is a problem with regex between Node and browser, and since Next runs the component on both, this create an error. `Prop `rel` did not match. Server: "null" Client: "noopener nofollow noreferrer"` This simplifies the implementation and fixes the above error. * Replace `react-helmet` with `next/head` * Fix `className` problem for JSX component * Fix broken bold markdown * Convert file to `.mjs` to be used by Node process * Add plugin to replace strings * Fix custom table row styling * Fix problem with `span` inside inline `code` React doesn't allow a `span` inside an inline `code` element and throws an error in dev mode. * Add `_document` to be able to customize `<html>` and `<body>` * Add `lang="en"` * Store Netlify settings in file This way we don't need to update via Netlify UI, which can be tricky if changing build settings. * Add sitemap * Add Smartypants * Add PWA support * Add `manifest.webmanifest` * Fix bug with anchor links after reloading There was no need for the previous implementation, since the browser handles this nativly. Additional the manual scrolling into view was actually broken, because the heading would disappear behind the menu bar. * Rename custom event I was googeling for ages to find out what kind of event `inview` is, only to figure out it was a custom event with a name that sounds pretty much like a native one. 🫠 * Fix missing comment syntax highlighting * Refactor Quickstart component The previous implementation was hidding the irrelevant lines via data-props and dynamically generated CSS. This created problems with Next and was also hard to follow. CSS was used to do what React is supposed to handle. The new implementation simplfy filters the list of children (React elements) via their props. * Fix syntax highlighting for Training Quickstart * Unify code rendering * Improve error logging in Juniper * Fix Juniper component * Automatically generate "Read Next" link * Add Plausible * Use recent DocSearch component and adjust styling * Fix images * Turn of image optimization > Image Optimization using Next.js' default loader is not compatible with `next export`. We currently deploy to Netlify via `next export` * Dont build pages starting with `_` * Remove unused files * Add Next plugin to Netlify * Fix button layout MDX automatically adds `p` tags around text on a new line and Prettier wants to put the text on a new line. Hacking with JSX string. * Add 404 page * Apply Prettier * Update Prettier for `package.json` Next sometimes wants to patch `package-lock.json`. The old Prettier setting indended with 4 spaces, but Next always indends with 2 spaces. Since `npm install` automatically uses the indendation from `package.json` for `package-lock.json` and to avoid the format switching back and forth, both files are now set to 2 spaces. * Apply Next patch to `package-lock.json` When starting the dev server Next would warn `warn - Found lockfile missing swc dependencies, patching...` and update the `package-lock.json`. These are the patched changes. * fix link Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * small backslash fixes * adjust to new style Co-authored-by: Marcus Blättermann <marcus@essenmitsosse.de>
2023-01-11 19:30:07 +03:00
### Runtime usage {id="transformers-runtime"}
2020-08-18 01:49:19 +03:00
Transformer models can be used as **drop-in replacements** for other types of
neural networks, so your spaCy pipeline can include them in a way that's
completely invisible to the user. Users will download, load and use the model in
the standard way, like any other spaCy pipeline. Instead of using the
transformers as subnetworks directly, you can also use them via the
[`Transformer`](/api/transformer) pipeline component.
Website migration from Gatsby to Next (#12058) * Rename all MDX file to `.mdx` * Lock current node version (#11885) * Apply Prettier (#11996) * Minor website fixes (#11974) [ci skip] * fix table * Migrate to Next WEB-17 (#12005) * Initial commit * Run `npx create-next-app@13 next-blog` * Install MDX packages Following: https://github.com/vercel/next.js/blob/77b5f79a4dff453abb62346bf75b14d859539b81/packages/next-mdx/readme.md * Add MDX to Next * Allow Next to handle `.md` and `.mdx` files. * Add VSCode extension recommendation * Disabled TypeScript strict mode for now * Add prettier * Apply Prettier to all files * Make sure to use correct Node version * Add basic implementation for `MDXRemote` * Add experimental Rust MDX parser * Add `/public` * Add SASS support * Remove default pages and styling * Convert to module This allows to use `import/export` syntax * Add import for custom components * Add ability to load plugins * Extract function This will make the next commit easier to read * Allow to handle directories for page creation * Refactoring * Allow to parse subfolders for pages * Extract logic * Redirect `index.mdx` to parent directory * Disabled ESLint during builds * Disabled typescript during build * Remove Gatsby from `README.md` * Rephrase Docker part of `README.md` * Update project structure in `README.md` * Move and rename plugins * Update plugin for wrapping sections * Add dependencies for plugin * Use plugin * Rename wrapper type * Simplify unnessary adding of id to sections The slugified section ids are useless, because they can not be referenced anywhere anyway. The navigation only works if the section has the same id as the heading. * Add plugin for custom attributes on Markdown elements * Add plugin to readd support for tables * Add plugin to fix problem with wrapped images For more details see this issue: https://github.com/mdx-js/mdx/issues/1798 * Add necessary meta data to pages * Install necessary dependencies * Remove outdated MDX handling * Remove reliance on `InlineList` * Use existing Remark components * Remove unallowed heading Before `h1` components where not overwritten and would never have worked and they aren't used anywhere either. * Add missing components to MDX * Add correct styling * Fix broken list * Fix broken CSS classes * Implement layout * Fix links * Fix broken images * Fix pattern image * Fix heading attributes * Rename heading attribute `new` was causing some weird issue, so renaming it to `version` * Update comment syntax in MDX * Merge imports * Fix markdown rendering inside components * Add model pages * Simplify anchors * Fix default value for theme * Add Universe index page * Add Universe categories * Add Universe projects * Fix Next problem with copy Next complains when the server renders something different then the client, therfor we move the differing logic to `useEffect` * Fix improper component nesting Next doesn't allow block elements inside a `<p>` * Replace landing page MDX with page component * Remove inlined iframe content * Remove ability to inline HTML content in iFrames * Remove MDX imports * Fix problem with image inside link in MDX * Escape character for MDX * Fix unescaped characters in MDX * Fix headings with logo * Allow to export static HTML pages * Add prebuild script This command is automatically run by Next * Replace `svg-loader` with `react-inlinesvg` `svg-loader` is no longer maintained * Fix ESLint `react-hooks/exhaustive-deps` * Fix dropdowns * Change code language from `cli` to `bash` * Remove unnessary language `none` * Fix invalid code language `markdown_` with an underscore was used to basically turn of syntax highlighting, but using unknown languages know throws an error. * Enable code blocks plugin * Readd `InlineCode` component MDX2 removed the `inlineCode` component > The special component name `inlineCode` was removed, we recommend to use `pre` for the block version of code, and code for both the block and inline versions Source: https://mdxjs.com/migrating/v2/#update-mdx-content * Remove unused code * Extract function to own file * Fix code syntax highlighting * Update syntax for code block meta data * Remove unused prop * Fix internal link recognition There is a problem with regex between Node and browser, and since Next runs the component on both, this create an error. `Prop `rel` did not match. Server: "null" Client: "noopener nofollow noreferrer"` This simplifies the implementation and fixes the above error. * Replace `react-helmet` with `next/head` * Fix `className` problem for JSX component * Fix broken bold markdown * Convert file to `.mjs` to be used by Node process * Add plugin to replace strings * Fix custom table row styling * Fix problem with `span` inside inline `code` React doesn't allow a `span` inside an inline `code` element and throws an error in dev mode. * Add `_document` to be able to customize `<html>` and `<body>` * Add `lang="en"` * Store Netlify settings in file This way we don't need to update via Netlify UI, which can be tricky if changing build settings. * Add sitemap * Add Smartypants * Add PWA support * Add `manifest.webmanifest` * Fix bug with anchor links after reloading There was no need for the previous implementation, since the browser handles this nativly. Additional the manual scrolling into view was actually broken, because the heading would disappear behind the menu bar. * Rename custom event I was googeling for ages to find out what kind of event `inview` is, only to figure out it was a custom event with a name that sounds pretty much like a native one. 🫠 * Fix missing comment syntax highlighting * Refactor Quickstart component The previous implementation was hidding the irrelevant lines via data-props and dynamically generated CSS. This created problems with Next and was also hard to follow. CSS was used to do what React is supposed to handle. The new implementation simplfy filters the list of children (React elements) via their props. * Fix syntax highlighting for Training Quickstart * Unify code rendering * Improve error logging in Juniper * Fix Juniper component * Automatically generate "Read Next" link * Add Plausible * Use recent DocSearch component and adjust styling * Fix images * Turn of image optimization > Image Optimization using Next.js' default loader is not compatible with `next export`. We currently deploy to Netlify via `next export` * Dont build pages starting with `_` * Remove unused files * Add Next plugin to Netlify * Fix button layout MDX automatically adds `p` tags around text on a new line and Prettier wants to put the text on a new line. Hacking with JSX string. * Add 404 page * Apply Prettier * Update Prettier for `package.json` Next sometimes wants to patch `package-lock.json`. The old Prettier setting indended with 4 spaces, but Next always indends with 2 spaces. Since `npm install` automatically uses the indendation from `package.json` for `package-lock.json` and to avoid the format switching back and forth, both files are now set to 2 spaces. * Apply Next patch to `package-lock.json` When starting the dev server Next would warn `warn - Found lockfile missing swc dependencies, patching...` and update the `package-lock.json`. These are the patched changes. * fix link Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * small backslash fixes * adjust to new style Co-authored-by: Marcus Blättermann <marcus@essenmitsosse.de>
2023-01-11 19:30:07 +03:00
![The processing pipeline with the transformer component](/images/pipeline_transformer.svg)
2020-08-18 01:49:19 +03:00
The `Transformer` component sets the
2020-08-18 01:49:19 +03:00
[`Doc._.trf_data`](/api/transformer#custom_attributes) extension attribute,
2020-10-09 13:04:52 +03:00
which lets you access the transformers outputs at runtime. The trained
transformer-based [pipelines](/models) provided by spaCy end on `_trf`, e.g.
[`en_core_web_trf`](/models/en#en_core_web_trf).
2020-08-18 01:49:19 +03:00
Website migration from Gatsby to Next (#12058) * Rename all MDX file to `.mdx` * Lock current node version (#11885) * Apply Prettier (#11996) * Minor website fixes (#11974) [ci skip] * fix table * Migrate to Next WEB-17 (#12005) * Initial commit * Run `npx create-next-app@13 next-blog` * Install MDX packages Following: https://github.com/vercel/next.js/blob/77b5f79a4dff453abb62346bf75b14d859539b81/packages/next-mdx/readme.md * Add MDX to Next * Allow Next to handle `.md` and `.mdx` files. * Add VSCode extension recommendation * Disabled TypeScript strict mode for now * Add prettier * Apply Prettier to all files * Make sure to use correct Node version * Add basic implementation for `MDXRemote` * Add experimental Rust MDX parser * Add `/public` * Add SASS support * Remove default pages and styling * Convert to module This allows to use `import/export` syntax * Add import for custom components * Add ability to load plugins * Extract function This will make the next commit easier to read * Allow to handle directories for page creation * Refactoring * Allow to parse subfolders for pages * Extract logic * Redirect `index.mdx` to parent directory * Disabled ESLint during builds * Disabled typescript during build * Remove Gatsby from `README.md` * Rephrase Docker part of `README.md` * Update project structure in `README.md` * Move and rename plugins * Update plugin for wrapping sections * Add dependencies for plugin * Use plugin * Rename wrapper type * Simplify unnessary adding of id to sections The slugified section ids are useless, because they can not be referenced anywhere anyway. The navigation only works if the section has the same id as the heading. * Add plugin for custom attributes on Markdown elements * Add plugin to readd support for tables * Add plugin to fix problem with wrapped images For more details see this issue: https://github.com/mdx-js/mdx/issues/1798 * Add necessary meta data to pages * Install necessary dependencies * Remove outdated MDX handling * Remove reliance on `InlineList` * Use existing Remark components * Remove unallowed heading Before `h1` components where not overwritten and would never have worked and they aren't used anywhere either. * Add missing components to MDX * Add correct styling * Fix broken list * Fix broken CSS classes * Implement layout * Fix links * Fix broken images * Fix pattern image * Fix heading attributes * Rename heading attribute `new` was causing some weird issue, so renaming it to `version` * Update comment syntax in MDX * Merge imports * Fix markdown rendering inside components * Add model pages * Simplify anchors * Fix default value for theme * Add Universe index page * Add Universe categories * Add Universe projects * Fix Next problem with copy Next complains when the server renders something different then the client, therfor we move the differing logic to `useEffect` * Fix improper component nesting Next doesn't allow block elements inside a `<p>` * Replace landing page MDX with page component * Remove inlined iframe content * Remove ability to inline HTML content in iFrames * Remove MDX imports * Fix problem with image inside link in MDX * Escape character for MDX * Fix unescaped characters in MDX * Fix headings with logo * Allow to export static HTML pages * Add prebuild script This command is automatically run by Next * Replace `svg-loader` with `react-inlinesvg` `svg-loader` is no longer maintained * Fix ESLint `react-hooks/exhaustive-deps` * Fix dropdowns * Change code language from `cli` to `bash` * Remove unnessary language `none` * Fix invalid code language `markdown_` with an underscore was used to basically turn of syntax highlighting, but using unknown languages know throws an error. * Enable code blocks plugin * Readd `InlineCode` component MDX2 removed the `inlineCode` component > The special component name `inlineCode` was removed, we recommend to use `pre` for the block version of code, and code for both the block and inline versions Source: https://mdxjs.com/migrating/v2/#update-mdx-content * Remove unused code * Extract function to own file * Fix code syntax highlighting * Update syntax for code block meta data * Remove unused prop * Fix internal link recognition There is a problem with regex between Node and browser, and since Next runs the component on both, this create an error. `Prop `rel` did not match. Server: "null" Client: "noopener nofollow noreferrer"` This simplifies the implementation and fixes the above error. * Replace `react-helmet` with `next/head` * Fix `className` problem for JSX component * Fix broken bold markdown * Convert file to `.mjs` to be used by Node process * Add plugin to replace strings * Fix custom table row styling * Fix problem with `span` inside inline `code` React doesn't allow a `span` inside an inline `code` element and throws an error in dev mode. * Add `_document` to be able to customize `<html>` and `<body>` * Add `lang="en"` * Store Netlify settings in file This way we don't need to update via Netlify UI, which can be tricky if changing build settings. * Add sitemap * Add Smartypants * Add PWA support * Add `manifest.webmanifest` * Fix bug with anchor links after reloading There was no need for the previous implementation, since the browser handles this nativly. Additional the manual scrolling into view was actually broken, because the heading would disappear behind the menu bar. * Rename custom event I was googeling for ages to find out what kind of event `inview` is, only to figure out it was a custom event with a name that sounds pretty much like a native one. 🫠 * Fix missing comment syntax highlighting * Refactor Quickstart component The previous implementation was hidding the irrelevant lines via data-props and dynamically generated CSS. This created problems with Next and was also hard to follow. CSS was used to do what React is supposed to handle. The new implementation simplfy filters the list of children (React elements) via their props. * Fix syntax highlighting for Training Quickstart * Unify code rendering * Improve error logging in Juniper * Fix Juniper component * Automatically generate "Read Next" link * Add Plausible * Use recent DocSearch component and adjust styling * Fix images * Turn of image optimization > Image Optimization using Next.js' default loader is not compatible with `next export`. We currently deploy to Netlify via `next export` * Dont build pages starting with `_` * Remove unused files * Add Next plugin to Netlify * Fix button layout MDX automatically adds `p` tags around text on a new line and Prettier wants to put the text on a new line. Hacking with JSX string. * Add 404 page * Apply Prettier * Update Prettier for `package.json` Next sometimes wants to patch `package-lock.json`. The old Prettier setting indended with 4 spaces, but Next always indends with 2 spaces. Since `npm install` automatically uses the indendation from `package.json` for `package-lock.json` and to avoid the format switching back and forth, both files are now set to 2 spaces. * Apply Next patch to `package-lock.json` When starting the dev server Next would warn `warn - Found lockfile missing swc dependencies, patching...` and update the `package-lock.json`. These are the patched changes. * fix link Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * small backslash fixes * adjust to new style Co-authored-by: Marcus Blättermann <marcus@essenmitsosse.de>
2023-01-11 19:30:07 +03:00
```bash
2020-10-08 12:56:50 +03:00
$ python -m spacy download en_core_web_trf
2020-08-18 01:49:19 +03:00
```
Website migration from Gatsby to Next (#12058) * Rename all MDX file to `.mdx` * Lock current node version (#11885) * Apply Prettier (#11996) * Minor website fixes (#11974) [ci skip] * fix table * Migrate to Next WEB-17 (#12005) * Initial commit * Run `npx create-next-app@13 next-blog` * Install MDX packages Following: https://github.com/vercel/next.js/blob/77b5f79a4dff453abb62346bf75b14d859539b81/packages/next-mdx/readme.md * Add MDX to Next * Allow Next to handle `.md` and `.mdx` files. * Add VSCode extension recommendation * Disabled TypeScript strict mode for now * Add prettier * Apply Prettier to all files * Make sure to use correct Node version * Add basic implementation for `MDXRemote` * Add experimental Rust MDX parser * Add `/public` * Add SASS support * Remove default pages and styling * Convert to module This allows to use `import/export` syntax * Add import for custom components * Add ability to load plugins * Extract function This will make the next commit easier to read * Allow to handle directories for page creation * Refactoring * Allow to parse subfolders for pages * Extract logic * Redirect `index.mdx` to parent directory * Disabled ESLint during builds * Disabled typescript during build * Remove Gatsby from `README.md` * Rephrase Docker part of `README.md` * Update project structure in `README.md` * Move and rename plugins * Update plugin for wrapping sections * Add dependencies for plugin * Use plugin * Rename wrapper type * Simplify unnessary adding of id to sections The slugified section ids are useless, because they can not be referenced anywhere anyway. The navigation only works if the section has the same id as the heading. * Add plugin for custom attributes on Markdown elements * Add plugin to readd support for tables * Add plugin to fix problem with wrapped images For more details see this issue: https://github.com/mdx-js/mdx/issues/1798 * Add necessary meta data to pages * Install necessary dependencies * Remove outdated MDX handling * Remove reliance on `InlineList` * Use existing Remark components * Remove unallowed heading Before `h1` components where not overwritten and would never have worked and they aren't used anywhere either. * Add missing components to MDX * Add correct styling * Fix broken list * Fix broken CSS classes * Implement layout * Fix links * Fix broken images * Fix pattern image * Fix heading attributes * Rename heading attribute `new` was causing some weird issue, so renaming it to `version` * Update comment syntax in MDX * Merge imports * Fix markdown rendering inside components * Add model pages * Simplify anchors * Fix default value for theme * Add Universe index page * Add Universe categories * Add Universe projects * Fix Next problem with copy Next complains when the server renders something different then the client, therfor we move the differing logic to `useEffect` * Fix improper component nesting Next doesn't allow block elements inside a `<p>` * Replace landing page MDX with page component * Remove inlined iframe content * Remove ability to inline HTML content in iFrames * Remove MDX imports * Fix problem with image inside link in MDX * Escape character for MDX * Fix unescaped characters in MDX * Fix headings with logo * Allow to export static HTML pages * Add prebuild script This command is automatically run by Next * Replace `svg-loader` with `react-inlinesvg` `svg-loader` is no longer maintained * Fix ESLint `react-hooks/exhaustive-deps` * Fix dropdowns * Change code language from `cli` to `bash` * Remove unnessary language `none` * Fix invalid code language `markdown_` with an underscore was used to basically turn of syntax highlighting, but using unknown languages know throws an error. * Enable code blocks plugin * Readd `InlineCode` component MDX2 removed the `inlineCode` component > The special component name `inlineCode` was removed, we recommend to use `pre` for the block version of code, and code for both the block and inline versions Source: https://mdxjs.com/migrating/v2/#update-mdx-content * Remove unused code * Extract function to own file * Fix code syntax highlighting * Update syntax for code block meta data * Remove unused prop * Fix internal link recognition There is a problem with regex between Node and browser, and since Next runs the component on both, this create an error. `Prop `rel` did not match. Server: "null" Client: "noopener nofollow noreferrer"` This simplifies the implementation and fixes the above error. * Replace `react-helmet` with `next/head` * Fix `className` problem for JSX component * Fix broken bold markdown * Convert file to `.mjs` to be used by Node process * Add plugin to replace strings * Fix custom table row styling * Fix problem with `span` inside inline `code` React doesn't allow a `span` inside an inline `code` element and throws an error in dev mode. * Add `_document` to be able to customize `<html>` and `<body>` * Add `lang="en"` * Store Netlify settings in file This way we don't need to update via Netlify UI, which can be tricky if changing build settings. * Add sitemap * Add Smartypants * Add PWA support * Add `manifest.webmanifest` * Fix bug with anchor links after reloading There was no need for the previous implementation, since the browser handles this nativly. Additional the manual scrolling into view was actually broken, because the heading would disappear behind the menu bar. * Rename custom event I was googeling for ages to find out what kind of event `inview` is, only to figure out it was a custom event with a name that sounds pretty much like a native one. 🫠 * Fix missing comment syntax highlighting * Refactor Quickstart component The previous implementation was hidding the irrelevant lines via data-props and dynamically generated CSS. This created problems with Next and was also hard to follow. CSS was used to do what React is supposed to handle. The new implementation simplfy filters the list of children (React elements) via their props. * Fix syntax highlighting for Training Quickstart * Unify code rendering * Improve error logging in Juniper * Fix Juniper component * Automatically generate "Read Next" link * Add Plausible * Use recent DocSearch component and adjust styling * Fix images * Turn of image optimization > Image Optimization using Next.js' default loader is not compatible with `next export`. We currently deploy to Netlify via `next export` * Dont build pages starting with `_` * Remove unused files * Add Next plugin to Netlify * Fix button layout MDX automatically adds `p` tags around text on a new line and Prettier wants to put the text on a new line. Hacking with JSX string. * Add 404 page * Apply Prettier * Update Prettier for `package.json` Next sometimes wants to patch `package-lock.json`. The old Prettier setting indended with 4 spaces, but Next always indends with 2 spaces. Since `npm install` automatically uses the indendation from `package.json` for `package-lock.json` and to avoid the format switching back and forth, both files are now set to 2 spaces. * Apply Next patch to `package-lock.json` When starting the dev server Next would warn `warn - Found lockfile missing swc dependencies, patching...` and update the `package-lock.json`. These are the patched changes. * fix link Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * small backslash fixes * adjust to new style Co-authored-by: Marcus Blättermann <marcus@essenmitsosse.de>
2023-01-11 19:30:07 +03:00
```python {title="Example"}
2020-08-18 01:49:19 +03:00
import spacy
2020-10-16 12:33:20 +03:00
from thinc.api import set_gpu_allocator, require_gpu
2020-08-18 01:49:19 +03:00
# Use the GPU, with memory allocations directed via PyTorch.
# This prevents out-of-memory errors that would otherwise occur from competing
# memory pools.
2020-10-16 12:33:20 +03:00
set_gpu_allocator("pytorch")
2020-08-18 01:49:19 +03:00
require_gpu(0)
2020-10-08 12:56:50 +03:00
nlp = spacy.load("en_core_web_trf")
2020-08-18 01:49:19 +03:00
for doc in nlp.pipe(["some text", "some other text"]):
tokvecs = doc._.trf_data.tensors[-1]
```
You can also customize how the [`Transformer`](/api/transformer) component sets
annotations onto the [`Doc`](/api/doc) by specifying a custom
`set_extra_annotations` function. This callback will be called with the raw
input and output data for the whole batch, along with the batch of `Doc`
objects, allowing you to implement whatever you need. The annotation setter is
called with a batch of [`Doc`](/api/doc) objects and a
2020-08-31 15:40:55 +03:00
[`FullTransformerBatch`](/api/transformer#fulltransformerbatch) containing the
transformers data for the batch.
2020-08-18 01:49:19 +03:00
```python
def custom_annotation_setter(docs, trf_data):
doc_data = list(trf_data.doc_data)
for doc, data in zip(docs, doc_data):
doc._.custom_attr = data
2020-08-18 01:49:19 +03:00
2020-10-08 12:56:50 +03:00
nlp = spacy.load("en_core_web_trf")
nlp.get_pipe("transformer").set_extra_annotations = custom_annotation_setter
2020-08-18 01:49:19 +03:00
doc = nlp("This is a text")
assert isinstance(doc._.custom_attr, TransformerData)
print(doc._.custom_attr.tensors)
2020-08-18 01:49:19 +03:00
```
Website migration from Gatsby to Next (#12058) * Rename all MDX file to `.mdx` * Lock current node version (#11885) * Apply Prettier (#11996) * Minor website fixes (#11974) [ci skip] * fix table * Migrate to Next WEB-17 (#12005) * Initial commit * Run `npx create-next-app@13 next-blog` * Install MDX packages Following: https://github.com/vercel/next.js/blob/77b5f79a4dff453abb62346bf75b14d859539b81/packages/next-mdx/readme.md * Add MDX to Next * Allow Next to handle `.md` and `.mdx` files. * Add VSCode extension recommendation * Disabled TypeScript strict mode for now * Add prettier * Apply Prettier to all files * Make sure to use correct Node version * Add basic implementation for `MDXRemote` * Add experimental Rust MDX parser * Add `/public` * Add SASS support * Remove default pages and styling * Convert to module This allows to use `import/export` syntax * Add import for custom components * Add ability to load plugins * Extract function This will make the next commit easier to read * Allow to handle directories for page creation * Refactoring * Allow to parse subfolders for pages * Extract logic * Redirect `index.mdx` to parent directory * Disabled ESLint during builds * Disabled typescript during build * Remove Gatsby from `README.md` * Rephrase Docker part of `README.md` * Update project structure in `README.md` * Move and rename plugins * Update plugin for wrapping sections * Add dependencies for plugin * Use plugin * Rename wrapper type * Simplify unnessary adding of id to sections The slugified section ids are useless, because they can not be referenced anywhere anyway. The navigation only works if the section has the same id as the heading. * Add plugin for custom attributes on Markdown elements * Add plugin to readd support for tables * Add plugin to fix problem with wrapped images For more details see this issue: https://github.com/mdx-js/mdx/issues/1798 * Add necessary meta data to pages * Install necessary dependencies * Remove outdated MDX handling * Remove reliance on `InlineList` * Use existing Remark components * Remove unallowed heading Before `h1` components where not overwritten and would never have worked and they aren't used anywhere either. * Add missing components to MDX * Add correct styling * Fix broken list * Fix broken CSS classes * Implement layout * Fix links * Fix broken images * Fix pattern image * Fix heading attributes * Rename heading attribute `new` was causing some weird issue, so renaming it to `version` * Update comment syntax in MDX * Merge imports * Fix markdown rendering inside components * Add model pages * Simplify anchors * Fix default value for theme * Add Universe index page * Add Universe categories * Add Universe projects * Fix Next problem with copy Next complains when the server renders something different then the client, therfor we move the differing logic to `useEffect` * Fix improper component nesting Next doesn't allow block elements inside a `<p>` * Replace landing page MDX with page component * Remove inlined iframe content * Remove ability to inline HTML content in iFrames * Remove MDX imports * Fix problem with image inside link in MDX * Escape character for MDX * Fix unescaped characters in MDX * Fix headings with logo * Allow to export static HTML pages * Add prebuild script This command is automatically run by Next * Replace `svg-loader` with `react-inlinesvg` `svg-loader` is no longer maintained * Fix ESLint `react-hooks/exhaustive-deps` * Fix dropdowns * Change code language from `cli` to `bash` * Remove unnessary language `none` * Fix invalid code language `markdown_` with an underscore was used to basically turn of syntax highlighting, but using unknown languages know throws an error. * Enable code blocks plugin * Readd `InlineCode` component MDX2 removed the `inlineCode` component > The special component name `inlineCode` was removed, we recommend to use `pre` for the block version of code, and code for both the block and inline versions Source: https://mdxjs.com/migrating/v2/#update-mdx-content * Remove unused code * Extract function to own file * Fix code syntax highlighting * Update syntax for code block meta data * Remove unused prop * Fix internal link recognition There is a problem with regex between Node and browser, and since Next runs the component on both, this create an error. `Prop `rel` did not match. Server: "null" Client: "noopener nofollow noreferrer"` This simplifies the implementation and fixes the above error. * Replace `react-helmet` with `next/head` * Fix `className` problem for JSX component * Fix broken bold markdown * Convert file to `.mjs` to be used by Node process * Add plugin to replace strings * Fix custom table row styling * Fix problem with `span` inside inline `code` React doesn't allow a `span` inside an inline `code` element and throws an error in dev mode. * Add `_document` to be able to customize `<html>` and `<body>` * Add `lang="en"` * Store Netlify settings in file This way we don't need to update via Netlify UI, which can be tricky if changing build settings. * Add sitemap * Add Smartypants * Add PWA support * Add `manifest.webmanifest` * Fix bug with anchor links after reloading There was no need for the previous implementation, since the browser handles this nativly. Additional the manual scrolling into view was actually broken, because the heading would disappear behind the menu bar. * Rename custom event I was googeling for ages to find out what kind of event `inview` is, only to figure out it was a custom event with a name that sounds pretty much like a native one. 🫠 * Fix missing comment syntax highlighting * Refactor Quickstart component The previous implementation was hidding the irrelevant lines via data-props and dynamically generated CSS. This created problems with Next and was also hard to follow. CSS was used to do what React is supposed to handle. The new implementation simplfy filters the list of children (React elements) via their props. * Fix syntax highlighting for Training Quickstart * Unify code rendering * Improve error logging in Juniper * Fix Juniper component * Automatically generate "Read Next" link * Add Plausible * Use recent DocSearch component and adjust styling * Fix images * Turn of image optimization > Image Optimization using Next.js' default loader is not compatible with `next export`. We currently deploy to Netlify via `next export` * Dont build pages starting with `_` * Remove unused files * Add Next plugin to Netlify * Fix button layout MDX automatically adds `p` tags around text on a new line and Prettier wants to put the text on a new line. Hacking with JSX string. * Add 404 page * Apply Prettier * Update Prettier for `package.json` Next sometimes wants to patch `package-lock.json`. The old Prettier setting indended with 4 spaces, but Next always indends with 2 spaces. Since `npm install` automatically uses the indendation from `package.json` for `package-lock.json` and to avoid the format switching back and forth, both files are now set to 2 spaces. * Apply Next patch to `package-lock.json` When starting the dev server Next would warn `warn - Found lockfile missing swc dependencies, patching...` and update the `package-lock.json`. These are the patched changes. * fix link Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * small backslash fixes * adjust to new style Co-authored-by: Marcus Blättermann <marcus@essenmitsosse.de>
2023-01-11 19:30:07 +03:00
### Training usage {id="transformers-training"}
2020-08-18 01:49:19 +03:00
The recommended workflow for training is to use spaCy's
[config system](/usage/training#config), usually via the
[`spacy train`](/api/cli#train) command. The training config defines all
component settings and hyperparameters in one place and lets you describe a tree
of objects by referring to creation functions, including functions you register
yourself. For details on how to get started with training your own model, check
out the [training quickstart](/usage/training#quickstart).
Website migration from Gatsby to Next (#12058) * Rename all MDX file to `.mdx` * Lock current node version (#11885) * Apply Prettier (#11996) * Minor website fixes (#11974) [ci skip] * fix table * Migrate to Next WEB-17 (#12005) * Initial commit * Run `npx create-next-app@13 next-blog` * Install MDX packages Following: https://github.com/vercel/next.js/blob/77b5f79a4dff453abb62346bf75b14d859539b81/packages/next-mdx/readme.md * Add MDX to Next * Allow Next to handle `.md` and `.mdx` files. * Add VSCode extension recommendation * Disabled TypeScript strict mode for now * Add prettier * Apply Prettier to all files * Make sure to use correct Node version * Add basic implementation for `MDXRemote` * Add experimental Rust MDX parser * Add `/public` * Add SASS support * Remove default pages and styling * Convert to module This allows to use `import/export` syntax * Add import for custom components * Add ability to load plugins * Extract function This will make the next commit easier to read * Allow to handle directories for page creation * Refactoring * Allow to parse subfolders for pages * Extract logic * Redirect `index.mdx` to parent directory * Disabled ESLint during builds * Disabled typescript during build * Remove Gatsby from `README.md` * Rephrase Docker part of `README.md` * Update project structure in `README.md` * Move and rename plugins * Update plugin for wrapping sections * Add dependencies for plugin * Use plugin * Rename wrapper type * Simplify unnessary adding of id to sections The slugified section ids are useless, because they can not be referenced anywhere anyway. The navigation only works if the section has the same id as the heading. * Add plugin for custom attributes on Markdown elements * Add plugin to readd support for tables * Add plugin to fix problem with wrapped images For more details see this issue: https://github.com/mdx-js/mdx/issues/1798 * Add necessary meta data to pages * Install necessary dependencies * Remove outdated MDX handling * Remove reliance on `InlineList` * Use existing Remark components * Remove unallowed heading Before `h1` components where not overwritten and would never have worked and they aren't used anywhere either. * Add missing components to MDX * Add correct styling * Fix broken list * Fix broken CSS classes * Implement layout * Fix links * Fix broken images * Fix pattern image * Fix heading attributes * Rename heading attribute `new` was causing some weird issue, so renaming it to `version` * Update comment syntax in MDX * Merge imports * Fix markdown rendering inside components * Add model pages * Simplify anchors * Fix default value for theme * Add Universe index page * Add Universe categories * Add Universe projects * Fix Next problem with copy Next complains when the server renders something different then the client, therfor we move the differing logic to `useEffect` * Fix improper component nesting Next doesn't allow block elements inside a `<p>` * Replace landing page MDX with page component * Remove inlined iframe content * Remove ability to inline HTML content in iFrames * Remove MDX imports * Fix problem with image inside link in MDX * Escape character for MDX * Fix unescaped characters in MDX * Fix headings with logo * Allow to export static HTML pages * Add prebuild script This command is automatically run by Next * Replace `svg-loader` with `react-inlinesvg` `svg-loader` is no longer maintained * Fix ESLint `react-hooks/exhaustive-deps` * Fix dropdowns * Change code language from `cli` to `bash` * Remove unnessary language `none` * Fix invalid code language `markdown_` with an underscore was used to basically turn of syntax highlighting, but using unknown languages know throws an error. * Enable code blocks plugin * Readd `InlineCode` component MDX2 removed the `inlineCode` component > The special component name `inlineCode` was removed, we recommend to use `pre` for the block version of code, and code for both the block and inline versions Source: https://mdxjs.com/migrating/v2/#update-mdx-content * Remove unused code * Extract function to own file * Fix code syntax highlighting * Update syntax for code block meta data * Remove unused prop * Fix internal link recognition There is a problem with regex between Node and browser, and since Next runs the component on both, this create an error. `Prop `rel` did not match. Server: "null" Client: "noopener nofollow noreferrer"` This simplifies the implementation and fixes the above error. * Replace `react-helmet` with `next/head` * Fix `className` problem for JSX component * Fix broken bold markdown * Convert file to `.mjs` to be used by Node process * Add plugin to replace strings * Fix custom table row styling * Fix problem with `span` inside inline `code` React doesn't allow a `span` inside an inline `code` element and throws an error in dev mode. * Add `_document` to be able to customize `<html>` and `<body>` * Add `lang="en"` * Store Netlify settings in file This way we don't need to update via Netlify UI, which can be tricky if changing build settings. * Add sitemap * Add Smartypants * Add PWA support * Add `manifest.webmanifest` * Fix bug with anchor links after reloading There was no need for the previous implementation, since the browser handles this nativly. Additional the manual scrolling into view was actually broken, because the heading would disappear behind the menu bar. * Rename custom event I was googeling for ages to find out what kind of event `inview` is, only to figure out it was a custom event with a name that sounds pretty much like a native one. 🫠 * Fix missing comment syntax highlighting * Refactor Quickstart component The previous implementation was hidding the irrelevant lines via data-props and dynamically generated CSS. This created problems with Next and was also hard to follow. CSS was used to do what React is supposed to handle. The new implementation simplfy filters the list of children (React elements) via their props. * Fix syntax highlighting for Training Quickstart * Unify code rendering * Improve error logging in Juniper * Fix Juniper component * Automatically generate "Read Next" link * Add Plausible * Use recent DocSearch component and adjust styling * Fix images * Turn of image optimization > Image Optimization using Next.js' default loader is not compatible with `next export`. We currently deploy to Netlify via `next export` * Dont build pages starting with `_` * Remove unused files * Add Next plugin to Netlify * Fix button layout MDX automatically adds `p` tags around text on a new line and Prettier wants to put the text on a new line. Hacking with JSX string. * Add 404 page * Apply Prettier * Update Prettier for `package.json` Next sometimes wants to patch `package-lock.json`. The old Prettier setting indended with 4 spaces, but Next always indends with 2 spaces. Since `npm install` automatically uses the indendation from `package.json` for `package-lock.json` and to avoid the format switching back and forth, both files are now set to 2 spaces. * Apply Next patch to `package-lock.json` When starting the dev server Next would warn `warn - Found lockfile missing swc dependencies, patching...` and update the `package-lock.json`. These are the patched changes. * fix link Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * small backslash fixes * adjust to new style Co-authored-by: Marcus Blättermann <marcus@essenmitsosse.de>
2023-01-11 19:30:07 +03:00
{/* TODO: <Project id="pipelines/transformers"> */}
2020-08-18 01:49:19 +03:00
Website migration from Gatsby to Next (#12058) * Rename all MDX file to `.mdx` * Lock current node version (#11885) * Apply Prettier (#11996) * Minor website fixes (#11974) [ci skip] * fix table * Migrate to Next WEB-17 (#12005) * Initial commit * Run `npx create-next-app@13 next-blog` * Install MDX packages Following: https://github.com/vercel/next.js/blob/77b5f79a4dff453abb62346bf75b14d859539b81/packages/next-mdx/readme.md * Add MDX to Next * Allow Next to handle `.md` and `.mdx` files. * Add VSCode extension recommendation * Disabled TypeScript strict mode for now * Add prettier * Apply Prettier to all files * Make sure to use correct Node version * Add basic implementation for `MDXRemote` * Add experimental Rust MDX parser * Add `/public` * Add SASS support * Remove default pages and styling * Convert to module This allows to use `import/export` syntax * Add import for custom components * Add ability to load plugins * Extract function This will make the next commit easier to read * Allow to handle directories for page creation * Refactoring * Allow to parse subfolders for pages * Extract logic * Redirect `index.mdx` to parent directory * Disabled ESLint during builds * Disabled typescript during build * Remove Gatsby from `README.md` * Rephrase Docker part of `README.md` * Update project structure in `README.md` * Move and rename plugins * Update plugin for wrapping sections * Add dependencies for plugin * Use plugin * Rename wrapper type * Simplify unnessary adding of id to sections The slugified section ids are useless, because they can not be referenced anywhere anyway. The navigation only works if the section has the same id as the heading. * Add plugin for custom attributes on Markdown elements * Add plugin to readd support for tables * Add plugin to fix problem with wrapped images For more details see this issue: https://github.com/mdx-js/mdx/issues/1798 * Add necessary meta data to pages * Install necessary dependencies * Remove outdated MDX handling * Remove reliance on `InlineList` * Use existing Remark components * Remove unallowed heading Before `h1` components where not overwritten and would never have worked and they aren't used anywhere either. * Add missing components to MDX * Add correct styling * Fix broken list * Fix broken CSS classes * Implement layout * Fix links * Fix broken images * Fix pattern image * Fix heading attributes * Rename heading attribute `new` was causing some weird issue, so renaming it to `version` * Update comment syntax in MDX * Merge imports * Fix markdown rendering inside components * Add model pages * Simplify anchors * Fix default value for theme * Add Universe index page * Add Universe categories * Add Universe projects * Fix Next problem with copy Next complains when the server renders something different then the client, therfor we move the differing logic to `useEffect` * Fix improper component nesting Next doesn't allow block elements inside a `<p>` * Replace landing page MDX with page component * Remove inlined iframe content * Remove ability to inline HTML content in iFrames * Remove MDX imports * Fix problem with image inside link in MDX * Escape character for MDX * Fix unescaped characters in MDX * Fix headings with logo * Allow to export static HTML pages * Add prebuild script This command is automatically run by Next * Replace `svg-loader` with `react-inlinesvg` `svg-loader` is no longer maintained * Fix ESLint `react-hooks/exhaustive-deps` * Fix dropdowns * Change code language from `cli` to `bash` * Remove unnessary language `none` * Fix invalid code language `markdown_` with an underscore was used to basically turn of syntax highlighting, but using unknown languages know throws an error. * Enable code blocks plugin * Readd `InlineCode` component MDX2 removed the `inlineCode` component > The special component name `inlineCode` was removed, we recommend to use `pre` for the block version of code, and code for both the block and inline versions Source: https://mdxjs.com/migrating/v2/#update-mdx-content * Remove unused code * Extract function to own file * Fix code syntax highlighting * Update syntax for code block meta data * Remove unused prop * Fix internal link recognition There is a problem with regex between Node and browser, and since Next runs the component on both, this create an error. `Prop `rel` did not match. Server: "null" Client: "noopener nofollow noreferrer"` This simplifies the implementation and fixes the above error. * Replace `react-helmet` with `next/head` * Fix `className` problem for JSX component * Fix broken bold markdown * Convert file to `.mjs` to be used by Node process * Add plugin to replace strings * Fix custom table row styling * Fix problem with `span` inside inline `code` React doesn't allow a `span` inside an inline `code` element and throws an error in dev mode. * Add `_document` to be able to customize `<html>` and `<body>` * Add `lang="en"` * Store Netlify settings in file This way we don't need to update via Netlify UI, which can be tricky if changing build settings. * Add sitemap * Add Smartypants * Add PWA support * Add `manifest.webmanifest` * Fix bug with anchor links after reloading There was no need for the previous implementation, since the browser handles this nativly. Additional the manual scrolling into view was actually broken, because the heading would disappear behind the menu bar. * Rename custom event I was googeling for ages to find out what kind of event `inview` is, only to figure out it was a custom event with a name that sounds pretty much like a native one. 🫠 * Fix missing comment syntax highlighting * Refactor Quickstart component The previous implementation was hidding the irrelevant lines via data-props and dynamically generated CSS. This created problems with Next and was also hard to follow. CSS was used to do what React is supposed to handle. The new implementation simplfy filters the list of children (React elements) via their props. * Fix syntax highlighting for Training Quickstart * Unify code rendering * Improve error logging in Juniper * Fix Juniper component * Automatically generate "Read Next" link * Add Plausible * Use recent DocSearch component and adjust styling * Fix images * Turn of image optimization > Image Optimization using Next.js' default loader is not compatible with `next export`. We currently deploy to Netlify via `next export` * Dont build pages starting with `_` * Remove unused files * Add Next plugin to Netlify * Fix button layout MDX automatically adds `p` tags around text on a new line and Prettier wants to put the text on a new line. Hacking with JSX string. * Add 404 page * Apply Prettier * Update Prettier for `package.json` Next sometimes wants to patch `package-lock.json`. The old Prettier setting indended with 4 spaces, but Next always indends with 2 spaces. Since `npm install` automatically uses the indendation from `package.json` for `package-lock.json` and to avoid the format switching back and forth, both files are now set to 2 spaces. * Apply Next patch to `package-lock.json` When starting the dev server Next would warn `warn - Found lockfile missing swc dependencies, patching...` and update the `package-lock.json`. These are the patched changes. * fix link Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * small backslash fixes * adjust to new style Co-authored-by: Marcus Blättermann <marcus@essenmitsosse.de>
2023-01-11 19:30:07 +03:00
{/* The easiest way to get started is to clone a transformers-based project */}
{/* template. Swap in your data, edit the settings and hyperparameters and train, */}
{/* evaluate, package and visualize your model. */}
2020-08-18 01:49:19 +03:00
Website migration from Gatsby to Next (#12058) * Rename all MDX file to `.mdx` * Lock current node version (#11885) * Apply Prettier (#11996) * Minor website fixes (#11974) [ci skip] * fix table * Migrate to Next WEB-17 (#12005) * Initial commit * Run `npx create-next-app@13 next-blog` * Install MDX packages Following: https://github.com/vercel/next.js/blob/77b5f79a4dff453abb62346bf75b14d859539b81/packages/next-mdx/readme.md * Add MDX to Next * Allow Next to handle `.md` and `.mdx` files. * Add VSCode extension recommendation * Disabled TypeScript strict mode for now * Add prettier * Apply Prettier to all files * Make sure to use correct Node version * Add basic implementation for `MDXRemote` * Add experimental Rust MDX parser * Add `/public` * Add SASS support * Remove default pages and styling * Convert to module This allows to use `import/export` syntax * Add import for custom components * Add ability to load plugins * Extract function This will make the next commit easier to read * Allow to handle directories for page creation * Refactoring * Allow to parse subfolders for pages * Extract logic * Redirect `index.mdx` to parent directory * Disabled ESLint during builds * Disabled typescript during build * Remove Gatsby from `README.md` * Rephrase Docker part of `README.md` * Update project structure in `README.md` * Move and rename plugins * Update plugin for wrapping sections * Add dependencies for plugin * Use plugin * Rename wrapper type * Simplify unnessary adding of id to sections The slugified section ids are useless, because they can not be referenced anywhere anyway. The navigation only works if the section has the same id as the heading. * Add plugin for custom attributes on Markdown elements * Add plugin to readd support for tables * Add plugin to fix problem with wrapped images For more details see this issue: https://github.com/mdx-js/mdx/issues/1798 * Add necessary meta data to pages * Install necessary dependencies * Remove outdated MDX handling * Remove reliance on `InlineList` * Use existing Remark components * Remove unallowed heading Before `h1` components where not overwritten and would never have worked and they aren't used anywhere either. * Add missing components to MDX * Add correct styling * Fix broken list * Fix broken CSS classes * Implement layout * Fix links * Fix broken images * Fix pattern image * Fix heading attributes * Rename heading attribute `new` was causing some weird issue, so renaming it to `version` * Update comment syntax in MDX * Merge imports * Fix markdown rendering inside components * Add model pages * Simplify anchors * Fix default value for theme * Add Universe index page * Add Universe categories * Add Universe projects * Fix Next problem with copy Next complains when the server renders something different then the client, therfor we move the differing logic to `useEffect` * Fix improper component nesting Next doesn't allow block elements inside a `<p>` * Replace landing page MDX with page component * Remove inlined iframe content * Remove ability to inline HTML content in iFrames * Remove MDX imports * Fix problem with image inside link in MDX * Escape character for MDX * Fix unescaped characters in MDX * Fix headings with logo * Allow to export static HTML pages * Add prebuild script This command is automatically run by Next * Replace `svg-loader` with `react-inlinesvg` `svg-loader` is no longer maintained * Fix ESLint `react-hooks/exhaustive-deps` * Fix dropdowns * Change code language from `cli` to `bash` * Remove unnessary language `none` * Fix invalid code language `markdown_` with an underscore was used to basically turn of syntax highlighting, but using unknown languages know throws an error. * Enable code blocks plugin * Readd `InlineCode` component MDX2 removed the `inlineCode` component > The special component name `inlineCode` was removed, we recommend to use `pre` for the block version of code, and code for both the block and inline versions Source: https://mdxjs.com/migrating/v2/#update-mdx-content * Remove unused code * Extract function to own file * Fix code syntax highlighting * Update syntax for code block meta data * Remove unused prop * Fix internal link recognition There is a problem with regex between Node and browser, and since Next runs the component on both, this create an error. `Prop `rel` did not match. Server: "null" Client: "noopener nofollow noreferrer"` This simplifies the implementation and fixes the above error. * Replace `react-helmet` with `next/head` * Fix `className` problem for JSX component * Fix broken bold markdown * Convert file to `.mjs` to be used by Node process * Add plugin to replace strings * Fix custom table row styling * Fix problem with `span` inside inline `code` React doesn't allow a `span` inside an inline `code` element and throws an error in dev mode. * Add `_document` to be able to customize `<html>` and `<body>` * Add `lang="en"` * Store Netlify settings in file This way we don't need to update via Netlify UI, which can be tricky if changing build settings. * Add sitemap * Add Smartypants * Add PWA support * Add `manifest.webmanifest` * Fix bug with anchor links after reloading There was no need for the previous implementation, since the browser handles this nativly. Additional the manual scrolling into view was actually broken, because the heading would disappear behind the menu bar. * Rename custom event I was googeling for ages to find out what kind of event `inview` is, only to figure out it was a custom event with a name that sounds pretty much like a native one. 🫠 * Fix missing comment syntax highlighting * Refactor Quickstart component The previous implementation was hidding the irrelevant lines via data-props and dynamically generated CSS. This created problems with Next and was also hard to follow. CSS was used to do what React is supposed to handle. The new implementation simplfy filters the list of children (React elements) via their props. * Fix syntax highlighting for Training Quickstart * Unify code rendering * Improve error logging in Juniper * Fix Juniper component * Automatically generate "Read Next" link * Add Plausible * Use recent DocSearch component and adjust styling * Fix images * Turn of image optimization > Image Optimization using Next.js' default loader is not compatible with `next export`. We currently deploy to Netlify via `next export` * Dont build pages starting with `_` * Remove unused files * Add Next plugin to Netlify * Fix button layout MDX automatically adds `p` tags around text on a new line and Prettier wants to put the text on a new line. Hacking with JSX string. * Add 404 page * Apply Prettier * Update Prettier for `package.json` Next sometimes wants to patch `package-lock.json`. The old Prettier setting indended with 4 spaces, but Next always indends with 2 spaces. Since `npm install` automatically uses the indendation from `package.json` for `package-lock.json` and to avoid the format switching back and forth, both files are now set to 2 spaces. * Apply Next patch to `package-lock.json` When starting the dev server Next would warn `warn - Found lockfile missing swc dependencies, patching...` and update the `package-lock.json`. These are the patched changes. * fix link Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * small backslash fixes * adjust to new style Co-authored-by: Marcus Blättermann <marcus@essenmitsosse.de>
2023-01-11 19:30:07 +03:00
{/* </Project> */}
2020-08-18 01:49:19 +03:00
The `[components]` section in the [`config.cfg`](/api/data-formats#config)
describes the pipeline components and the settings used to construct them,
including their model implementation. Here's a config snippet for the
[`Transformer`](/api/transformer) component, along with matching Python code. In
this case, the `[components.transformer]` block describes the `transformer`
component:
> #### Python equivalent
>
> ```python
> from spacy_transformers import Transformer, TransformerModel
> from spacy_transformers.annotation_setters import null_annotation_setter
2020-08-18 01:49:19 +03:00
> from spacy_transformers.span_getters import get_doc_spans
>
> trf = Transformer(
> nlp.vocab,
> TransformerModel(
> "bert-base-cased",
> get_spans=get_doc_spans,
> tokenizer_config={"use_fast": True},
> ),
> set_extra_annotations=null_annotation_setter,
2020-08-18 01:49:19 +03:00
> max_batch_items=4096,
> )
> ```
Website migration from Gatsby to Next (#12058) * Rename all MDX file to `.mdx` * Lock current node version (#11885) * Apply Prettier (#11996) * Minor website fixes (#11974) [ci skip] * fix table * Migrate to Next WEB-17 (#12005) * Initial commit * Run `npx create-next-app@13 next-blog` * Install MDX packages Following: https://github.com/vercel/next.js/blob/77b5f79a4dff453abb62346bf75b14d859539b81/packages/next-mdx/readme.md * Add MDX to Next * Allow Next to handle `.md` and `.mdx` files. * Add VSCode extension recommendation * Disabled TypeScript strict mode for now * Add prettier * Apply Prettier to all files * Make sure to use correct Node version * Add basic implementation for `MDXRemote` * Add experimental Rust MDX parser * Add `/public` * Add SASS support * Remove default pages and styling * Convert to module This allows to use `import/export` syntax * Add import for custom components * Add ability to load plugins * Extract function This will make the next commit easier to read * Allow to handle directories for page creation * Refactoring * Allow to parse subfolders for pages * Extract logic * Redirect `index.mdx` to parent directory * Disabled ESLint during builds * Disabled typescript during build * Remove Gatsby from `README.md` * Rephrase Docker part of `README.md` * Update project structure in `README.md` * Move and rename plugins * Update plugin for wrapping sections * Add dependencies for plugin * Use plugin * Rename wrapper type * Simplify unnessary adding of id to sections The slugified section ids are useless, because they can not be referenced anywhere anyway. The navigation only works if the section has the same id as the heading. * Add plugin for custom attributes on Markdown elements * Add plugin to readd support for tables * Add plugin to fix problem with wrapped images For more details see this issue: https://github.com/mdx-js/mdx/issues/1798 * Add necessary meta data to pages * Install necessary dependencies * Remove outdated MDX handling * Remove reliance on `InlineList` * Use existing Remark components * Remove unallowed heading Before `h1` components where not overwritten and would never have worked and they aren't used anywhere either. * Add missing components to MDX * Add correct styling * Fix broken list * Fix broken CSS classes * Implement layout * Fix links * Fix broken images * Fix pattern image * Fix heading attributes * Rename heading attribute `new` was causing some weird issue, so renaming it to `version` * Update comment syntax in MDX * Merge imports * Fix markdown rendering inside components * Add model pages * Simplify anchors * Fix default value for theme * Add Universe index page * Add Universe categories * Add Universe projects * Fix Next problem with copy Next complains when the server renders something different then the client, therfor we move the differing logic to `useEffect` * Fix improper component nesting Next doesn't allow block elements inside a `<p>` * Replace landing page MDX with page component * Remove inlined iframe content * Remove ability to inline HTML content in iFrames * Remove MDX imports * Fix problem with image inside link in MDX * Escape character for MDX * Fix unescaped characters in MDX * Fix headings with logo * Allow to export static HTML pages * Add prebuild script This command is automatically run by Next * Replace `svg-loader` with `react-inlinesvg` `svg-loader` is no longer maintained * Fix ESLint `react-hooks/exhaustive-deps` * Fix dropdowns * Change code language from `cli` to `bash` * Remove unnessary language `none` * Fix invalid code language `markdown_` with an underscore was used to basically turn of syntax highlighting, but using unknown languages know throws an error. * Enable code blocks plugin * Readd `InlineCode` component MDX2 removed the `inlineCode` component > The special component name `inlineCode` was removed, we recommend to use `pre` for the block version of code, and code for both the block and inline versions Source: https://mdxjs.com/migrating/v2/#update-mdx-content * Remove unused code * Extract function to own file * Fix code syntax highlighting * Update syntax for code block meta data * Remove unused prop * Fix internal link recognition There is a problem with regex between Node and browser, and since Next runs the component on both, this create an error. `Prop `rel` did not match. Server: "null" Client: "noopener nofollow noreferrer"` This simplifies the implementation and fixes the above error. * Replace `react-helmet` with `next/head` * Fix `className` problem for JSX component * Fix broken bold markdown * Convert file to `.mjs` to be used by Node process * Add plugin to replace strings * Fix custom table row styling * Fix problem with `span` inside inline `code` React doesn't allow a `span` inside an inline `code` element and throws an error in dev mode. * Add `_document` to be able to customize `<html>` and `<body>` * Add `lang="en"` * Store Netlify settings in file This way we don't need to update via Netlify UI, which can be tricky if changing build settings. * Add sitemap * Add Smartypants * Add PWA support * Add `manifest.webmanifest` * Fix bug with anchor links after reloading There was no need for the previous implementation, since the browser handles this nativly. Additional the manual scrolling into view was actually broken, because the heading would disappear behind the menu bar. * Rename custom event I was googeling for ages to find out what kind of event `inview` is, only to figure out it was a custom event with a name that sounds pretty much like a native one. 🫠 * Fix missing comment syntax highlighting * Refactor Quickstart component The previous implementation was hidding the irrelevant lines via data-props and dynamically generated CSS. This created problems with Next and was also hard to follow. CSS was used to do what React is supposed to handle. The new implementation simplfy filters the list of children (React elements) via their props. * Fix syntax highlighting for Training Quickstart * Unify code rendering * Improve error logging in Juniper * Fix Juniper component * Automatically generate "Read Next" link * Add Plausible * Use recent DocSearch component and adjust styling * Fix images * Turn of image optimization > Image Optimization using Next.js' default loader is not compatible with `next export`. We currently deploy to Netlify via `next export` * Dont build pages starting with `_` * Remove unused files * Add Next plugin to Netlify * Fix button layout MDX automatically adds `p` tags around text on a new line and Prettier wants to put the text on a new line. Hacking with JSX string. * Add 404 page * Apply Prettier * Update Prettier for `package.json` Next sometimes wants to patch `package-lock.json`. The old Prettier setting indended with 4 spaces, but Next always indends with 2 spaces. Since `npm install` automatically uses the indendation from `package.json` for `package-lock.json` and to avoid the format switching back and forth, both files are now set to 2 spaces. * Apply Next patch to `package-lock.json` When starting the dev server Next would warn `warn - Found lockfile missing swc dependencies, patching...` and update the `package-lock.json`. These are the patched changes. * fix link Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * small backslash fixes * adjust to new style Co-authored-by: Marcus Blättermann <marcus@essenmitsosse.de>
2023-01-11 19:30:07 +03:00
```ini {title="config.cfg",excerpt="true"}
2020-08-18 01:49:19 +03:00
[components.transformer]
factory = "transformer"
max_batch_items = 4096
[components.transformer.model]
@architectures = "spacy-transformers.TransformerModel.v3"
2020-08-18 01:49:19 +03:00
name = "bert-base-cased"
tokenizer_config = {"use_fast": true}
[components.transformer.model.get_spans]
2020-09-03 18:37:06 +03:00
@span_getters = "spacy-transformers.doc_spans.v1"
2020-08-18 01:49:19 +03:00
[components.transformer.set_extra_annotations]
@annotation_setters = "spacy-transformers.null_annotation_setter.v1"
2020-08-18 01:49:19 +03:00
```
The `[components.transformer.model]` block describes the `model` argument passed
to the transformer component. It's a Thinc
[`Model`](https://thinc.ai/docs/api-model) object that will be passed into the
component. Here, it references the function
[spacy-transformers.TransformerModel.v3](/api/architectures#TransformerModel)
2020-08-18 01:49:19 +03:00
registered in the [`architectures` registry](/api/top-level#registry). If a key
in a block starts with `@`, it's **resolved to a function** and all other
settings are passed to the function as arguments. In this case, `name`,
`tokenizer_config` and `get_spans`.
2020-08-27 11:19:43 +03:00
`get_spans` is a function that takes a batch of `Doc` objects and returns lists
2020-08-18 01:49:19 +03:00
of potentially overlapping `Span` objects to process by the transformer. Several
2020-08-27 11:19:43 +03:00
[built-in functions](/api/transformer#span_getters) are available for example,
2020-08-18 01:49:19 +03:00
to process the whole document or individual sentences. When the config is
resolved, the function is created and passed into the model as an argument.
The `name` value is the name of any [HuggingFace model](huggingface-models),
which will be downloaded automatically the first time it's used. You can also
use a local file path. For full details, see the
[`TransformerModel` docs](/api/architectures#TransformerModel).
[huggingface-models]:
https://huggingface.co/models?library=pytorch&sort=downloads
A wide variety of PyTorch models are supported, but some might not work. If a
model doesn't seem to work feel free to open an
[issue](https://github.com/explosion/spacy/issues). Additionally note that
Transformers loaded in spaCy can only be used for tensors, and pretrained
task-specific heads or text generation features cannot be used as part of the
`transformer` pipeline component.
2020-08-18 01:49:19 +03:00
<Infobox variant="warning">
Remember that the `config.cfg` used for training should contain **no missing
values** and requires all settings to be defined. You don't want any hidden
defaults creeping in and changing your results! spaCy will tell you if settings
are missing, and you can run
[`spacy init fill-config`](/api/cli#init-fill-config) to automatically fill in
all defaults.
</Infobox>
Website migration from Gatsby to Next (#12058) * Rename all MDX file to `.mdx` * Lock current node version (#11885) * Apply Prettier (#11996) * Minor website fixes (#11974) [ci skip] * fix table * Migrate to Next WEB-17 (#12005) * Initial commit * Run `npx create-next-app@13 next-blog` * Install MDX packages Following: https://github.com/vercel/next.js/blob/77b5f79a4dff453abb62346bf75b14d859539b81/packages/next-mdx/readme.md * Add MDX to Next * Allow Next to handle `.md` and `.mdx` files. * Add VSCode extension recommendation * Disabled TypeScript strict mode for now * Add prettier * Apply Prettier to all files * Make sure to use correct Node version * Add basic implementation for `MDXRemote` * Add experimental Rust MDX parser * Add `/public` * Add SASS support * Remove default pages and styling * Convert to module This allows to use `import/export` syntax * Add import for custom components * Add ability to load plugins * Extract function This will make the next commit easier to read * Allow to handle directories for page creation * Refactoring * Allow to parse subfolders for pages * Extract logic * Redirect `index.mdx` to parent directory * Disabled ESLint during builds * Disabled typescript during build * Remove Gatsby from `README.md` * Rephrase Docker part of `README.md` * Update project structure in `README.md` * Move and rename plugins * Update plugin for wrapping sections * Add dependencies for plugin * Use plugin * Rename wrapper type * Simplify unnessary adding of id to sections The slugified section ids are useless, because they can not be referenced anywhere anyway. The navigation only works if the section has the same id as the heading. * Add plugin for custom attributes on Markdown elements * Add plugin to readd support for tables * Add plugin to fix problem with wrapped images For more details see this issue: https://github.com/mdx-js/mdx/issues/1798 * Add necessary meta data to pages * Install necessary dependencies * Remove outdated MDX handling * Remove reliance on `InlineList` * Use existing Remark components * Remove unallowed heading Before `h1` components where not overwritten and would never have worked and they aren't used anywhere either. * Add missing components to MDX * Add correct styling * Fix broken list * Fix broken CSS classes * Implement layout * Fix links * Fix broken images * Fix pattern image * Fix heading attributes * Rename heading attribute `new` was causing some weird issue, so renaming it to `version` * Update comment syntax in MDX * Merge imports * Fix markdown rendering inside components * Add model pages * Simplify anchors * Fix default value for theme * Add Universe index page * Add Universe categories * Add Universe projects * Fix Next problem with copy Next complains when the server renders something different then the client, therfor we move the differing logic to `useEffect` * Fix improper component nesting Next doesn't allow block elements inside a `<p>` * Replace landing page MDX with page component * Remove inlined iframe content * Remove ability to inline HTML content in iFrames * Remove MDX imports * Fix problem with image inside link in MDX * Escape character for MDX * Fix unescaped characters in MDX * Fix headings with logo * Allow to export static HTML pages * Add prebuild script This command is automatically run by Next * Replace `svg-loader` with `react-inlinesvg` `svg-loader` is no longer maintained * Fix ESLint `react-hooks/exhaustive-deps` * Fix dropdowns * Change code language from `cli` to `bash` * Remove unnessary language `none` * Fix invalid code language `markdown_` with an underscore was used to basically turn of syntax highlighting, but using unknown languages know throws an error. * Enable code blocks plugin * Readd `InlineCode` component MDX2 removed the `inlineCode` component > The special component name `inlineCode` was removed, we recommend to use `pre` for the block version of code, and code for both the block and inline versions Source: https://mdxjs.com/migrating/v2/#update-mdx-content * Remove unused code * Extract function to own file * Fix code syntax highlighting * Update syntax for code block meta data * Remove unused prop * Fix internal link recognition There is a problem with regex between Node and browser, and since Next runs the component on both, this create an error. `Prop `rel` did not match. Server: "null" Client: "noopener nofollow noreferrer"` This simplifies the implementation and fixes the above error. * Replace `react-helmet` with `next/head` * Fix `className` problem for JSX component * Fix broken bold markdown * Convert file to `.mjs` to be used by Node process * Add plugin to replace strings * Fix custom table row styling * Fix problem with `span` inside inline `code` React doesn't allow a `span` inside an inline `code` element and throws an error in dev mode. * Add `_document` to be able to customize `<html>` and `<body>` * Add `lang="en"` * Store Netlify settings in file This way we don't need to update via Netlify UI, which can be tricky if changing build settings. * Add sitemap * Add Smartypants * Add PWA support * Add `manifest.webmanifest` * Fix bug with anchor links after reloading There was no need for the previous implementation, since the browser handles this nativly. Additional the manual scrolling into view was actually broken, because the heading would disappear behind the menu bar. * Rename custom event I was googeling for ages to find out what kind of event `inview` is, only to figure out it was a custom event with a name that sounds pretty much like a native one. 🫠 * Fix missing comment syntax highlighting * Refactor Quickstart component The previous implementation was hidding the irrelevant lines via data-props and dynamically generated CSS. This created problems with Next and was also hard to follow. CSS was used to do what React is supposed to handle. The new implementation simplfy filters the list of children (React elements) via their props. * Fix syntax highlighting for Training Quickstart * Unify code rendering * Improve error logging in Juniper * Fix Juniper component * Automatically generate "Read Next" link * Add Plausible * Use recent DocSearch component and adjust styling * Fix images * Turn of image optimization > Image Optimization using Next.js' default loader is not compatible with `next export`. We currently deploy to Netlify via `next export` * Dont build pages starting with `_` * Remove unused files * Add Next plugin to Netlify * Fix button layout MDX automatically adds `p` tags around text on a new line and Prettier wants to put the text on a new line. Hacking with JSX string. * Add 404 page * Apply Prettier * Update Prettier for `package.json` Next sometimes wants to patch `package-lock.json`. The old Prettier setting indended with 4 spaces, but Next always indends with 2 spaces. Since `npm install` automatically uses the indendation from `package.json` for `package-lock.json` and to avoid the format switching back and forth, both files are now set to 2 spaces. * Apply Next patch to `package-lock.json` When starting the dev server Next would warn `warn - Found lockfile missing swc dependencies, patching...` and update the `package-lock.json`. These are the patched changes. * fix link Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * small backslash fixes * adjust to new style Co-authored-by: Marcus Blättermann <marcus@essenmitsosse.de>
2023-01-11 19:30:07 +03:00
### Customizing the settings {id="transformers-training-custom-settings"}
2020-08-18 01:49:19 +03:00
To change any of the settings, you can edit the `config.cfg` and re-run the
training. To change any of the functions, like the span getter, you can replace
2020-09-03 18:37:06 +03:00
the name of the referenced function e.g.
`@span_getters = "spacy-transformers.sent_spans.v1"` to process sentences. You
can also register your own functions using the
2020-08-31 15:40:55 +03:00
[`span_getters` registry](/api/top-level#registry). For instance, the following
custom function returns [`Span`](/api/span) objects following sentence
boundaries, unless a sentence succeeds a certain amount of tokens, in which case
2020-08-29 13:53:14 +03:00
subsentences of at most `max_length` tokens are returned.
2020-08-18 01:49:19 +03:00
> #### config.cfg
>
> ```ini
> [components.transformer.model.get_spans]
> @span_getters = "custom_sent_spans"
2020-08-27 15:14:16 +03:00
> max_length = 25
2020-08-18 01:49:19 +03:00
> ```
Website migration from Gatsby to Next (#12058) * Rename all MDX file to `.mdx` * Lock current node version (#11885) * Apply Prettier (#11996) * Minor website fixes (#11974) [ci skip] * fix table * Migrate to Next WEB-17 (#12005) * Initial commit * Run `npx create-next-app@13 next-blog` * Install MDX packages Following: https://github.com/vercel/next.js/blob/77b5f79a4dff453abb62346bf75b14d859539b81/packages/next-mdx/readme.md * Add MDX to Next * Allow Next to handle `.md` and `.mdx` files. * Add VSCode extension recommendation * Disabled TypeScript strict mode for now * Add prettier * Apply Prettier to all files * Make sure to use correct Node version * Add basic implementation for `MDXRemote` * Add experimental Rust MDX parser * Add `/public` * Add SASS support * Remove default pages and styling * Convert to module This allows to use `import/export` syntax * Add import for custom components * Add ability to load plugins * Extract function This will make the next commit easier to read * Allow to handle directories for page creation * Refactoring * Allow to parse subfolders for pages * Extract logic * Redirect `index.mdx` to parent directory * Disabled ESLint during builds * Disabled typescript during build * Remove Gatsby from `README.md` * Rephrase Docker part of `README.md` * Update project structure in `README.md` * Move and rename plugins * Update plugin for wrapping sections * Add dependencies for plugin * Use plugin * Rename wrapper type * Simplify unnessary adding of id to sections The slugified section ids are useless, because they can not be referenced anywhere anyway. The navigation only works if the section has the same id as the heading. * Add plugin for custom attributes on Markdown elements * Add plugin to readd support for tables * Add plugin to fix problem with wrapped images For more details see this issue: https://github.com/mdx-js/mdx/issues/1798 * Add necessary meta data to pages * Install necessary dependencies * Remove outdated MDX handling * Remove reliance on `InlineList` * Use existing Remark components * Remove unallowed heading Before `h1` components where not overwritten and would never have worked and they aren't used anywhere either. * Add missing components to MDX * Add correct styling * Fix broken list * Fix broken CSS classes * Implement layout * Fix links * Fix broken images * Fix pattern image * Fix heading attributes * Rename heading attribute `new` was causing some weird issue, so renaming it to `version` * Update comment syntax in MDX * Merge imports * Fix markdown rendering inside components * Add model pages * Simplify anchors * Fix default value for theme * Add Universe index page * Add Universe categories * Add Universe projects * Fix Next problem with copy Next complains when the server renders something different then the client, therfor we move the differing logic to `useEffect` * Fix improper component nesting Next doesn't allow block elements inside a `<p>` * Replace landing page MDX with page component * Remove inlined iframe content * Remove ability to inline HTML content in iFrames * Remove MDX imports * Fix problem with image inside link in MDX * Escape character for MDX * Fix unescaped characters in MDX * Fix headings with logo * Allow to export static HTML pages * Add prebuild script This command is automatically run by Next * Replace `svg-loader` with `react-inlinesvg` `svg-loader` is no longer maintained * Fix ESLint `react-hooks/exhaustive-deps` * Fix dropdowns * Change code language from `cli` to `bash` * Remove unnessary language `none` * Fix invalid code language `markdown_` with an underscore was used to basically turn of syntax highlighting, but using unknown languages know throws an error. * Enable code blocks plugin * Readd `InlineCode` component MDX2 removed the `inlineCode` component > The special component name `inlineCode` was removed, we recommend to use `pre` for the block version of code, and code for both the block and inline versions Source: https://mdxjs.com/migrating/v2/#update-mdx-content * Remove unused code * Extract function to own file * Fix code syntax highlighting * Update syntax for code block meta data * Remove unused prop * Fix internal link recognition There is a problem with regex between Node and browser, and since Next runs the component on both, this create an error. `Prop `rel` did not match. Server: "null" Client: "noopener nofollow noreferrer"` This simplifies the implementation and fixes the above error. * Replace `react-helmet` with `next/head` * Fix `className` problem for JSX component * Fix broken bold markdown * Convert file to `.mjs` to be used by Node process * Add plugin to replace strings * Fix custom table row styling * Fix problem with `span` inside inline `code` React doesn't allow a `span` inside an inline `code` element and throws an error in dev mode. * Add `_document` to be able to customize `<html>` and `<body>` * Add `lang="en"` * Store Netlify settings in file This way we don't need to update via Netlify UI, which can be tricky if changing build settings. * Add sitemap * Add Smartypants * Add PWA support * Add `manifest.webmanifest` * Fix bug with anchor links after reloading There was no need for the previous implementation, since the browser handles this nativly. Additional the manual scrolling into view was actually broken, because the heading would disappear behind the menu bar. * Rename custom event I was googeling for ages to find out what kind of event `inview` is, only to figure out it was a custom event with a name that sounds pretty much like a native one. 🫠 * Fix missing comment syntax highlighting * Refactor Quickstart component The previous implementation was hidding the irrelevant lines via data-props and dynamically generated CSS. This created problems with Next and was also hard to follow. CSS was used to do what React is supposed to handle. The new implementation simplfy filters the list of children (React elements) via their props. * Fix syntax highlighting for Training Quickstart * Unify code rendering * Improve error logging in Juniper * Fix Juniper component * Automatically generate "Read Next" link * Add Plausible * Use recent DocSearch component and adjust styling * Fix images * Turn of image optimization > Image Optimization using Next.js' default loader is not compatible with `next export`. We currently deploy to Netlify via `next export` * Dont build pages starting with `_` * Remove unused files * Add Next plugin to Netlify * Fix button layout MDX automatically adds `p` tags around text on a new line and Prettier wants to put the text on a new line. Hacking with JSX string. * Add 404 page * Apply Prettier * Update Prettier for `package.json` Next sometimes wants to patch `package-lock.json`. The old Prettier setting indended with 4 spaces, but Next always indends with 2 spaces. Since `npm install` automatically uses the indendation from `package.json` for `package-lock.json` and to avoid the format switching back and forth, both files are now set to 2 spaces. * Apply Next patch to `package-lock.json` When starting the dev server Next would warn `warn - Found lockfile missing swc dependencies, patching...` and update the `package-lock.json`. These are the patched changes. * fix link Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * small backslash fixes * adjust to new style Co-authored-by: Marcus Blättermann <marcus@essenmitsosse.de>
2023-01-11 19:30:07 +03:00
```python {title="code.py"}
2020-08-18 01:49:19 +03:00
import spacy_transformers
@spacy_transformers.registry.span_getters("custom_sent_spans")
2020-08-27 15:14:16 +03:00
def configure_custom_sent_spans(max_length: int):
def get_custom_sent_spans(docs):
spans = []
for doc in docs:
spans.append([])
for sent in doc.sents:
start = 0
end = max_length
while end <= len(sent):
spans[-1].append(sent[start:end])
start += max_length
end += max_length
if start < len(sent):
2020-08-27 15:33:28 +03:00
spans[-1].append(sent[start:len(sent)])
2020-08-27 15:14:16 +03:00
return spans
return get_custom_sent_spans
2020-08-18 01:49:19 +03:00
```
To resolve the config during training, spaCy needs to know about your custom
function. You can make it available via the `--code` argument that can point to
a Python file. For more details on training with custom code, see the
2020-08-29 13:53:14 +03:00
[training documentation](/usage/training#custom-functions).
2020-08-18 01:49:19 +03:00
Website migration from Gatsby to Next (#12058) * Rename all MDX file to `.mdx` * Lock current node version (#11885) * Apply Prettier (#11996) * Minor website fixes (#11974) [ci skip] * fix table * Migrate to Next WEB-17 (#12005) * Initial commit * Run `npx create-next-app@13 next-blog` * Install MDX packages Following: https://github.com/vercel/next.js/blob/77b5f79a4dff453abb62346bf75b14d859539b81/packages/next-mdx/readme.md * Add MDX to Next * Allow Next to handle `.md` and `.mdx` files. * Add VSCode extension recommendation * Disabled TypeScript strict mode for now * Add prettier * Apply Prettier to all files * Make sure to use correct Node version * Add basic implementation for `MDXRemote` * Add experimental Rust MDX parser * Add `/public` * Add SASS support * Remove default pages and styling * Convert to module This allows to use `import/export` syntax * Add import for custom components * Add ability to load plugins * Extract function This will make the next commit easier to read * Allow to handle directories for page creation * Refactoring * Allow to parse subfolders for pages * Extract logic * Redirect `index.mdx` to parent directory * Disabled ESLint during builds * Disabled typescript during build * Remove Gatsby from `README.md` * Rephrase Docker part of `README.md` * Update project structure in `README.md` * Move and rename plugins * Update plugin for wrapping sections * Add dependencies for plugin * Use plugin * Rename wrapper type * Simplify unnessary adding of id to sections The slugified section ids are useless, because they can not be referenced anywhere anyway. The navigation only works if the section has the same id as the heading. * Add plugin for custom attributes on Markdown elements * Add plugin to readd support for tables * Add plugin to fix problem with wrapped images For more details see this issue: https://github.com/mdx-js/mdx/issues/1798 * Add necessary meta data to pages * Install necessary dependencies * Remove outdated MDX handling * Remove reliance on `InlineList` * Use existing Remark components * Remove unallowed heading Before `h1` components where not overwritten and would never have worked and they aren't used anywhere either. * Add missing components to MDX * Add correct styling * Fix broken list * Fix broken CSS classes * Implement layout * Fix links * Fix broken images * Fix pattern image * Fix heading attributes * Rename heading attribute `new` was causing some weird issue, so renaming it to `version` * Update comment syntax in MDX * Merge imports * Fix markdown rendering inside components * Add model pages * Simplify anchors * Fix default value for theme * Add Universe index page * Add Universe categories * Add Universe projects * Fix Next problem with copy Next complains when the server renders something different then the client, therfor we move the differing logic to `useEffect` * Fix improper component nesting Next doesn't allow block elements inside a `<p>` * Replace landing page MDX with page component * Remove inlined iframe content * Remove ability to inline HTML content in iFrames * Remove MDX imports * Fix problem with image inside link in MDX * Escape character for MDX * Fix unescaped characters in MDX * Fix headings with logo * Allow to export static HTML pages * Add prebuild script This command is automatically run by Next * Replace `svg-loader` with `react-inlinesvg` `svg-loader` is no longer maintained * Fix ESLint `react-hooks/exhaustive-deps` * Fix dropdowns * Change code language from `cli` to `bash` * Remove unnessary language `none` * Fix invalid code language `markdown_` with an underscore was used to basically turn of syntax highlighting, but using unknown languages know throws an error. * Enable code blocks plugin * Readd `InlineCode` component MDX2 removed the `inlineCode` component > The special component name `inlineCode` was removed, we recommend to use `pre` for the block version of code, and code for both the block and inline versions Source: https://mdxjs.com/migrating/v2/#update-mdx-content * Remove unused code * Extract function to own file * Fix code syntax highlighting * Update syntax for code block meta data * Remove unused prop * Fix internal link recognition There is a problem with regex between Node and browser, and since Next runs the component on both, this create an error. `Prop `rel` did not match. Server: "null" Client: "noopener nofollow noreferrer"` This simplifies the implementation and fixes the above error. * Replace `react-helmet` with `next/head` * Fix `className` problem for JSX component * Fix broken bold markdown * Convert file to `.mjs` to be used by Node process * Add plugin to replace strings * Fix custom table row styling * Fix problem with `span` inside inline `code` React doesn't allow a `span` inside an inline `code` element and throws an error in dev mode. * Add `_document` to be able to customize `<html>` and `<body>` * Add `lang="en"` * Store Netlify settings in file This way we don't need to update via Netlify UI, which can be tricky if changing build settings. * Add sitemap * Add Smartypants * Add PWA support * Add `manifest.webmanifest` * Fix bug with anchor links after reloading There was no need for the previous implementation, since the browser handles this nativly. Additional the manual scrolling into view was actually broken, because the heading would disappear behind the menu bar. * Rename custom event I was googeling for ages to find out what kind of event `inview` is, only to figure out it was a custom event with a name that sounds pretty much like a native one. 🫠 * Fix missing comment syntax highlighting * Refactor Quickstart component The previous implementation was hidding the irrelevant lines via data-props and dynamically generated CSS. This created problems with Next and was also hard to follow. CSS was used to do what React is supposed to handle. The new implementation simplfy filters the list of children (React elements) via their props. * Fix syntax highlighting for Training Quickstart * Unify code rendering * Improve error logging in Juniper * Fix Juniper component * Automatically generate "Read Next" link * Add Plausible * Use recent DocSearch component and adjust styling * Fix images * Turn of image optimization > Image Optimization using Next.js' default loader is not compatible with `next export`. We currently deploy to Netlify via `next export` * Dont build pages starting with `_` * Remove unused files * Add Next plugin to Netlify * Fix button layout MDX automatically adds `p` tags around text on a new line and Prettier wants to put the text on a new line. Hacking with JSX string. * Add 404 page * Apply Prettier * Update Prettier for `package.json` Next sometimes wants to patch `package-lock.json`. The old Prettier setting indended with 4 spaces, but Next always indends with 2 spaces. Since `npm install` automatically uses the indendation from `package.json` for `package-lock.json` and to avoid the format switching back and forth, both files are now set to 2 spaces. * Apply Next patch to `package-lock.json` When starting the dev server Next would warn `warn - Found lockfile missing swc dependencies, patching...` and update the `package-lock.json`. These are the patched changes. * fix link Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * small backslash fixes * adjust to new style Co-authored-by: Marcus Blättermann <marcus@essenmitsosse.de>
2023-01-11 19:30:07 +03:00
```bash
2020-08-19 01:28:37 +03:00
python -m spacy train ./config.cfg --code ./code.py
2020-08-18 01:49:19 +03:00
```
Website migration from Gatsby to Next (#12058) * Rename all MDX file to `.mdx` * Lock current node version (#11885) * Apply Prettier (#11996) * Minor website fixes (#11974) [ci skip] * fix table * Migrate to Next WEB-17 (#12005) * Initial commit * Run `npx create-next-app@13 next-blog` * Install MDX packages Following: https://github.com/vercel/next.js/blob/77b5f79a4dff453abb62346bf75b14d859539b81/packages/next-mdx/readme.md * Add MDX to Next * Allow Next to handle `.md` and `.mdx` files. * Add VSCode extension recommendation * Disabled TypeScript strict mode for now * Add prettier * Apply Prettier to all files * Make sure to use correct Node version * Add basic implementation for `MDXRemote` * Add experimental Rust MDX parser * Add `/public` * Add SASS support * Remove default pages and styling * Convert to module This allows to use `import/export` syntax * Add import for custom components * Add ability to load plugins * Extract function This will make the next commit easier to read * Allow to handle directories for page creation * Refactoring * Allow to parse subfolders for pages * Extract logic * Redirect `index.mdx` to parent directory * Disabled ESLint during builds * Disabled typescript during build * Remove Gatsby from `README.md` * Rephrase Docker part of `README.md` * Update project structure in `README.md` * Move and rename plugins * Update plugin for wrapping sections * Add dependencies for plugin * Use plugin * Rename wrapper type * Simplify unnessary adding of id to sections The slugified section ids are useless, because they can not be referenced anywhere anyway. The navigation only works if the section has the same id as the heading. * Add plugin for custom attributes on Markdown elements * Add plugin to readd support for tables * Add plugin to fix problem with wrapped images For more details see this issue: https://github.com/mdx-js/mdx/issues/1798 * Add necessary meta data to pages * Install necessary dependencies * Remove outdated MDX handling * Remove reliance on `InlineList` * Use existing Remark components * Remove unallowed heading Before `h1` components where not overwritten and would never have worked and they aren't used anywhere either. * Add missing components to MDX * Add correct styling * Fix broken list * Fix broken CSS classes * Implement layout * Fix links * Fix broken images * Fix pattern image * Fix heading attributes * Rename heading attribute `new` was causing some weird issue, so renaming it to `version` * Update comment syntax in MDX * Merge imports * Fix markdown rendering inside components * Add model pages * Simplify anchors * Fix default value for theme * Add Universe index page * Add Universe categories * Add Universe projects * Fix Next problem with copy Next complains when the server renders something different then the client, therfor we move the differing logic to `useEffect` * Fix improper component nesting Next doesn't allow block elements inside a `<p>` * Replace landing page MDX with page component * Remove inlined iframe content * Remove ability to inline HTML content in iFrames * Remove MDX imports * Fix problem with image inside link in MDX * Escape character for MDX * Fix unescaped characters in MDX * Fix headings with logo * Allow to export static HTML pages * Add prebuild script This command is automatically run by Next * Replace `svg-loader` with `react-inlinesvg` `svg-loader` is no longer maintained * Fix ESLint `react-hooks/exhaustive-deps` * Fix dropdowns * Change code language from `cli` to `bash` * Remove unnessary language `none` * Fix invalid code language `markdown_` with an underscore was used to basically turn of syntax highlighting, but using unknown languages know throws an error. * Enable code blocks plugin * Readd `InlineCode` component MDX2 removed the `inlineCode` component > The special component name `inlineCode` was removed, we recommend to use `pre` for the block version of code, and code for both the block and inline versions Source: https://mdxjs.com/migrating/v2/#update-mdx-content * Remove unused code * Extract function to own file * Fix code syntax highlighting * Update syntax for code block meta data * Remove unused prop * Fix internal link recognition There is a problem with regex between Node and browser, and since Next runs the component on both, this create an error. `Prop `rel` did not match. Server: "null" Client: "noopener nofollow noreferrer"` This simplifies the implementation and fixes the above error. * Replace `react-helmet` with `next/head` * Fix `className` problem for JSX component * Fix broken bold markdown * Convert file to `.mjs` to be used by Node process * Add plugin to replace strings * Fix custom table row styling * Fix problem with `span` inside inline `code` React doesn't allow a `span` inside an inline `code` element and throws an error in dev mode. * Add `_document` to be able to customize `<html>` and `<body>` * Add `lang="en"` * Store Netlify settings in file This way we don't need to update via Netlify UI, which can be tricky if changing build settings. * Add sitemap * Add Smartypants * Add PWA support * Add `manifest.webmanifest` * Fix bug with anchor links after reloading There was no need for the previous implementation, since the browser handles this nativly. Additional the manual scrolling into view was actually broken, because the heading would disappear behind the menu bar. * Rename custom event I was googeling for ages to find out what kind of event `inview` is, only to figure out it was a custom event with a name that sounds pretty much like a native one. 🫠 * Fix missing comment syntax highlighting * Refactor Quickstart component The previous implementation was hidding the irrelevant lines via data-props and dynamically generated CSS. This created problems with Next and was also hard to follow. CSS was used to do what React is supposed to handle. The new implementation simplfy filters the list of children (React elements) via their props. * Fix syntax highlighting for Training Quickstart * Unify code rendering * Improve error logging in Juniper * Fix Juniper component * Automatically generate "Read Next" link * Add Plausible * Use recent DocSearch component and adjust styling * Fix images * Turn of image optimization > Image Optimization using Next.js' default loader is not compatible with `next export`. We currently deploy to Netlify via `next export` * Dont build pages starting with `_` * Remove unused files * Add Next plugin to Netlify * Fix button layout MDX automatically adds `p` tags around text on a new line and Prettier wants to put the text on a new line. Hacking with JSX string. * Add 404 page * Apply Prettier * Update Prettier for `package.json` Next sometimes wants to patch `package-lock.json`. The old Prettier setting indended with 4 spaces, but Next always indends with 2 spaces. Since `npm install` automatically uses the indendation from `package.json` for `package-lock.json` and to avoid the format switching back and forth, both files are now set to 2 spaces. * Apply Next patch to `package-lock.json` When starting the dev server Next would warn `warn - Found lockfile missing swc dependencies, patching...` and update the `package-lock.json`. These are the patched changes. * fix link Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * small backslash fixes * adjust to new style Co-authored-by: Marcus Blättermann <marcus@essenmitsosse.de>
2023-01-11 19:30:07 +03:00
### Customizing the model implementations {id="training-custom-model"}
2020-08-18 01:49:19 +03:00
The [`Transformer`](/api/transformer) component expects a Thinc
[`Model`](https://thinc.ai/docs/api-model) object to be passed in as its `model`
argument. You're not limited to the implementation provided by
`spacy-transformers` the only requirement is that your registered function
must return an object of type ~~Model[List[Doc], FullTransformerBatch]~~: that
is, a Thinc model that takes a list of [`Doc`](/api/doc) objects, and returns a
[`FullTransformerBatch`](/api/transformer#fulltransformerbatch) object with the
transformer data.
The same idea applies to task models that power the **downstream components**.
Most of spaCy's built-in model creation functions support a `tok2vec` argument,
which should be a Thinc layer of type ~~Model[List[Doc], List[Floats2d]]~~. This
is where we'll plug in our transformer model, using the
2020-08-27 20:24:44 +03:00
[TransformerListener](/api/architectures#TransformerListener) layer, which
sneakily delegates to the `Transformer` pipeline component.
2020-08-18 01:49:19 +03:00
Website migration from Gatsby to Next (#12058) * Rename all MDX file to `.mdx` * Lock current node version (#11885) * Apply Prettier (#11996) * Minor website fixes (#11974) [ci skip] * fix table * Migrate to Next WEB-17 (#12005) * Initial commit * Run `npx create-next-app@13 next-blog` * Install MDX packages Following: https://github.com/vercel/next.js/blob/77b5f79a4dff453abb62346bf75b14d859539b81/packages/next-mdx/readme.md * Add MDX to Next * Allow Next to handle `.md` and `.mdx` files. * Add VSCode extension recommendation * Disabled TypeScript strict mode for now * Add prettier * Apply Prettier to all files * Make sure to use correct Node version * Add basic implementation for `MDXRemote` * Add experimental Rust MDX parser * Add `/public` * Add SASS support * Remove default pages and styling * Convert to module This allows to use `import/export` syntax * Add import for custom components * Add ability to load plugins * Extract function This will make the next commit easier to read * Allow to handle directories for page creation * Refactoring * Allow to parse subfolders for pages * Extract logic * Redirect `index.mdx` to parent directory * Disabled ESLint during builds * Disabled typescript during build * Remove Gatsby from `README.md` * Rephrase Docker part of `README.md` * Update project structure in `README.md` * Move and rename plugins * Update plugin for wrapping sections * Add dependencies for plugin * Use plugin * Rename wrapper type * Simplify unnessary adding of id to sections The slugified section ids are useless, because they can not be referenced anywhere anyway. The navigation only works if the section has the same id as the heading. * Add plugin for custom attributes on Markdown elements * Add plugin to readd support for tables * Add plugin to fix problem with wrapped images For more details see this issue: https://github.com/mdx-js/mdx/issues/1798 * Add necessary meta data to pages * Install necessary dependencies * Remove outdated MDX handling * Remove reliance on `InlineList` * Use existing Remark components * Remove unallowed heading Before `h1` components where not overwritten and would never have worked and they aren't used anywhere either. * Add missing components to MDX * Add correct styling * Fix broken list * Fix broken CSS classes * Implement layout * Fix links * Fix broken images * Fix pattern image * Fix heading attributes * Rename heading attribute `new` was causing some weird issue, so renaming it to `version` * Update comment syntax in MDX * Merge imports * Fix markdown rendering inside components * Add model pages * Simplify anchors * Fix default value for theme * Add Universe index page * Add Universe categories * Add Universe projects * Fix Next problem with copy Next complains when the server renders something different then the client, therfor we move the differing logic to `useEffect` * Fix improper component nesting Next doesn't allow block elements inside a `<p>` * Replace landing page MDX with page component * Remove inlined iframe content * Remove ability to inline HTML content in iFrames * Remove MDX imports * Fix problem with image inside link in MDX * Escape character for MDX * Fix unescaped characters in MDX * Fix headings with logo * Allow to export static HTML pages * Add prebuild script This command is automatically run by Next * Replace `svg-loader` with `react-inlinesvg` `svg-loader` is no longer maintained * Fix ESLint `react-hooks/exhaustive-deps` * Fix dropdowns * Change code language from `cli` to `bash` * Remove unnessary language `none` * Fix invalid code language `markdown_` with an underscore was used to basically turn of syntax highlighting, but using unknown languages know throws an error. * Enable code blocks plugin * Readd `InlineCode` component MDX2 removed the `inlineCode` component > The special component name `inlineCode` was removed, we recommend to use `pre` for the block version of code, and code for both the block and inline versions Source: https://mdxjs.com/migrating/v2/#update-mdx-content * Remove unused code * Extract function to own file * Fix code syntax highlighting * Update syntax for code block meta data * Remove unused prop * Fix internal link recognition There is a problem with regex between Node and browser, and since Next runs the component on both, this create an error. `Prop `rel` did not match. Server: "null" Client: "noopener nofollow noreferrer"` This simplifies the implementation and fixes the above error. * Replace `react-helmet` with `next/head` * Fix `className` problem for JSX component * Fix broken bold markdown * Convert file to `.mjs` to be used by Node process * Add plugin to replace strings * Fix custom table row styling * Fix problem with `span` inside inline `code` React doesn't allow a `span` inside an inline `code` element and throws an error in dev mode. * Add `_document` to be able to customize `<html>` and `<body>` * Add `lang="en"` * Store Netlify settings in file This way we don't need to update via Netlify UI, which can be tricky if changing build settings. * Add sitemap * Add Smartypants * Add PWA support * Add `manifest.webmanifest` * Fix bug with anchor links after reloading There was no need for the previous implementation, since the browser handles this nativly. Additional the manual scrolling into view was actually broken, because the heading would disappear behind the menu bar. * Rename custom event I was googeling for ages to find out what kind of event `inview` is, only to figure out it was a custom event with a name that sounds pretty much like a native one. 🫠 * Fix missing comment syntax highlighting * Refactor Quickstart component The previous implementation was hidding the irrelevant lines via data-props and dynamically generated CSS. This created problems with Next and was also hard to follow. CSS was used to do what React is supposed to handle. The new implementation simplfy filters the list of children (React elements) via their props. * Fix syntax highlighting for Training Quickstart * Unify code rendering * Improve error logging in Juniper * Fix Juniper component * Automatically generate "Read Next" link * Add Plausible * Use recent DocSearch component and adjust styling * Fix images * Turn of image optimization > Image Optimization using Next.js' default loader is not compatible with `next export`. We currently deploy to Netlify via `next export` * Dont build pages starting with `_` * Remove unused files * Add Next plugin to Netlify * Fix button layout MDX automatically adds `p` tags around text on a new line and Prettier wants to put the text on a new line. Hacking with JSX string. * Add 404 page * Apply Prettier * Update Prettier for `package.json` Next sometimes wants to patch `package-lock.json`. The old Prettier setting indended with 4 spaces, but Next always indends with 2 spaces. Since `npm install` automatically uses the indendation from `package.json` for `package-lock.json` and to avoid the format switching back and forth, both files are now set to 2 spaces. * Apply Next patch to `package-lock.json` When starting the dev server Next would warn `warn - Found lockfile missing swc dependencies, patching...` and update the `package-lock.json`. These are the patched changes. * fix link Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * small backslash fixes * adjust to new style Co-authored-by: Marcus Blättermann <marcus@essenmitsosse.de>
2023-01-11 19:30:07 +03:00
```ini {title="config.cfg (excerpt)",highlight="12"}
2020-08-18 01:49:19 +03:00
[components.ner]
factory = "ner"
[nlp.pipeline.ner.model]
@architectures = "spacy.TransitionBasedParser.v1"
state_type = "ner"
extra_state_tokens = false
2020-08-18 01:49:19 +03:00
hidden_width = 128
maxout_pieces = 3
use_upper = false
[nlp.pipeline.ner.model.tok2vec]
2020-08-27 15:33:28 +03:00
@architectures = "spacy-transformers.TransformerListener.v1"
2020-08-18 01:49:19 +03:00
grad_factor = 1.0
[nlp.pipeline.ner.model.tok2vec.pooling]
@layers = "reduce_mean.v1"
```
2020-08-27 20:24:44 +03:00
The [TransformerListener](/api/architectures#TransformerListener) layer expects
a [pooling layer](https://thinc.ai/docs/api-layers#reduction-ops) as the
argument `pooling`, which needs to be of type ~~Model[Ragged, Floats2d]~~. This
layer determines how the vector for each spaCy token will be computed from the
zero or more source rows the token is aligned against. Here we use the
2020-08-18 01:49:19 +03:00
[`reduce_mean`](https://thinc.ai/docs/api-layers#reduce_mean) layer, which
averages the wordpiece rows. We could instead use
[`reduce_max`](https://thinc.ai/docs/api-layers#reduce_max), or a custom
function you write yourself.
You can have multiple components all listening to the same transformer model,
and all passing gradients back to it. By default, all of the gradients will be
**equally weighted**. You can control this with the `grad_factor` setting, which
lets you reweight the gradients from the different listeners. For instance,
setting `grad_factor = 0` would disable gradients from one of the listeners,
while `grad_factor = 2.0` would multiply them by 2. This is similar to having a
custom learning rate for each component. Instead of a constant, you can also
provide a schedule, allowing you to freeze the shared parameters at the start of
training.
Website migration from Gatsby to Next (#12058) * Rename all MDX file to `.mdx` * Lock current node version (#11885) * Apply Prettier (#11996) * Minor website fixes (#11974) [ci skip] * fix table * Migrate to Next WEB-17 (#12005) * Initial commit * Run `npx create-next-app@13 next-blog` * Install MDX packages Following: https://github.com/vercel/next.js/blob/77b5f79a4dff453abb62346bf75b14d859539b81/packages/next-mdx/readme.md * Add MDX to Next * Allow Next to handle `.md` and `.mdx` files. * Add VSCode extension recommendation * Disabled TypeScript strict mode for now * Add prettier * Apply Prettier to all files * Make sure to use correct Node version * Add basic implementation for `MDXRemote` * Add experimental Rust MDX parser * Add `/public` * Add SASS support * Remove default pages and styling * Convert to module This allows to use `import/export` syntax * Add import for custom components * Add ability to load plugins * Extract function This will make the next commit easier to read * Allow to handle directories for page creation * Refactoring * Allow to parse subfolders for pages * Extract logic * Redirect `index.mdx` to parent directory * Disabled ESLint during builds * Disabled typescript during build * Remove Gatsby from `README.md` * Rephrase Docker part of `README.md` * Update project structure in `README.md` * Move and rename plugins * Update plugin for wrapping sections * Add dependencies for plugin * Use plugin * Rename wrapper type * Simplify unnessary adding of id to sections The slugified section ids are useless, because they can not be referenced anywhere anyway. The navigation only works if the section has the same id as the heading. * Add plugin for custom attributes on Markdown elements * Add plugin to readd support for tables * Add plugin to fix problem with wrapped images For more details see this issue: https://github.com/mdx-js/mdx/issues/1798 * Add necessary meta data to pages * Install necessary dependencies * Remove outdated MDX handling * Remove reliance on `InlineList` * Use existing Remark components * Remove unallowed heading Before `h1` components where not overwritten and would never have worked and they aren't used anywhere either. * Add missing components to MDX * Add correct styling * Fix broken list * Fix broken CSS classes * Implement layout * Fix links * Fix broken images * Fix pattern image * Fix heading attributes * Rename heading attribute `new` was causing some weird issue, so renaming it to `version` * Update comment syntax in MDX * Merge imports * Fix markdown rendering inside components * Add model pages * Simplify anchors * Fix default value for theme * Add Universe index page * Add Universe categories * Add Universe projects * Fix Next problem with copy Next complains when the server renders something different then the client, therfor we move the differing logic to `useEffect` * Fix improper component nesting Next doesn't allow block elements inside a `<p>` * Replace landing page MDX with page component * Remove inlined iframe content * Remove ability to inline HTML content in iFrames * Remove MDX imports * Fix problem with image inside link in MDX * Escape character for MDX * Fix unescaped characters in MDX * Fix headings with logo * Allow to export static HTML pages * Add prebuild script This command is automatically run by Next * Replace `svg-loader` with `react-inlinesvg` `svg-loader` is no longer maintained * Fix ESLint `react-hooks/exhaustive-deps` * Fix dropdowns * Change code language from `cli` to `bash` * Remove unnessary language `none` * Fix invalid code language `markdown_` with an underscore was used to basically turn of syntax highlighting, but using unknown languages know throws an error. * Enable code blocks plugin * Readd `InlineCode` component MDX2 removed the `inlineCode` component > The special component name `inlineCode` was removed, we recommend to use `pre` for the block version of code, and code for both the block and inline versions Source: https://mdxjs.com/migrating/v2/#update-mdx-content * Remove unused code * Extract function to own file * Fix code syntax highlighting * Update syntax for code block meta data * Remove unused prop * Fix internal link recognition There is a problem with regex between Node and browser, and since Next runs the component on both, this create an error. `Prop `rel` did not match. Server: "null" Client: "noopener nofollow noreferrer"` This simplifies the implementation and fixes the above error. * Replace `react-helmet` with `next/head` * Fix `className` problem for JSX component * Fix broken bold markdown * Convert file to `.mjs` to be used by Node process * Add plugin to replace strings * Fix custom table row styling * Fix problem with `span` inside inline `code` React doesn't allow a `span` inside an inline `code` element and throws an error in dev mode. * Add `_document` to be able to customize `<html>` and `<body>` * Add `lang="en"` * Store Netlify settings in file This way we don't need to update via Netlify UI, which can be tricky if changing build settings. * Add sitemap * Add Smartypants * Add PWA support * Add `manifest.webmanifest` * Fix bug with anchor links after reloading There was no need for the previous implementation, since the browser handles this nativly. Additional the manual scrolling into view was actually broken, because the heading would disappear behind the menu bar. * Rename custom event I was googeling for ages to find out what kind of event `inview` is, only to figure out it was a custom event with a name that sounds pretty much like a native one. 🫠 * Fix missing comment syntax highlighting * Refactor Quickstart component The previous implementation was hidding the irrelevant lines via data-props and dynamically generated CSS. This created problems with Next and was also hard to follow. CSS was used to do what React is supposed to handle. The new implementation simplfy filters the list of children (React elements) via their props. * Fix syntax highlighting for Training Quickstart * Unify code rendering * Improve error logging in Juniper * Fix Juniper component * Automatically generate "Read Next" link * Add Plausible * Use recent DocSearch component and adjust styling * Fix images * Turn of image optimization > Image Optimization using Next.js' default loader is not compatible with `next export`. We currently deploy to Netlify via `next export` * Dont build pages starting with `_` * Remove unused files * Add Next plugin to Netlify * Fix button layout MDX automatically adds `p` tags around text on a new line and Prettier wants to put the text on a new line. Hacking with JSX string. * Add 404 page * Apply Prettier * Update Prettier for `package.json` Next sometimes wants to patch `package-lock.json`. The old Prettier setting indended with 4 spaces, but Next always indends with 2 spaces. Since `npm install` automatically uses the indendation from `package.json` for `package-lock.json` and to avoid the format switching back and forth, both files are now set to 2 spaces. * Apply Next patch to `package-lock.json` When starting the dev server Next would warn `warn - Found lockfile missing swc dependencies, patching...` and update the `package-lock.json`. These are the patched changes. * fix link Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * small backslash fixes * adjust to new style Co-authored-by: Marcus Blättermann <marcus@essenmitsosse.de>
2023-01-11 19:30:07 +03:00
## Static vectors {id="static-vectors"}
2020-08-18 01:49:19 +03:00
2020-09-17 18:12:51 +03:00
If your pipeline includes a **word vectors table**, you'll be able to use the
`.similarity()` method on the [`Doc`](/api/doc), [`Span`](/api/span),
[`Token`](/api/token) and [`Lexeme`](/api/lexeme) objects. You'll also be able
to access the vectors using the `.vector` attribute, or you can look up one or
more vectors directly using the [`Vocab`](/api/vocab) object. Pipelines with
word vectors can also **use the vectors as features** for the statistical
models, which can **improve the accuracy** of your components.
2020-09-17 17:42:53 +03:00
Word vectors in spaCy are "static" in the sense that they are not learned
parameters of the statistical models, and spaCy itself does not feature any
algorithms for learning word vector tables. You can train a word vectors table
using tools such as [floret](https://github.com/explosion/floret),
Website migration from Gatsby to Next (#12058) * Rename all MDX file to `.mdx` * Lock current node version (#11885) * Apply Prettier (#11996) * Minor website fixes (#11974) [ci skip] * fix table * Migrate to Next WEB-17 (#12005) * Initial commit * Run `npx create-next-app@13 next-blog` * Install MDX packages Following: https://github.com/vercel/next.js/blob/77b5f79a4dff453abb62346bf75b14d859539b81/packages/next-mdx/readme.md * Add MDX to Next * Allow Next to handle `.md` and `.mdx` files. * Add VSCode extension recommendation * Disabled TypeScript strict mode for now * Add prettier * Apply Prettier to all files * Make sure to use correct Node version * Add basic implementation for `MDXRemote` * Add experimental Rust MDX parser * Add `/public` * Add SASS support * Remove default pages and styling * Convert to module This allows to use `import/export` syntax * Add import for custom components * Add ability to load plugins * Extract function This will make the next commit easier to read * Allow to handle directories for page creation * Refactoring * Allow to parse subfolders for pages * Extract logic * Redirect `index.mdx` to parent directory * Disabled ESLint during builds * Disabled typescript during build * Remove Gatsby from `README.md` * Rephrase Docker part of `README.md` * Update project structure in `README.md` * Move and rename plugins * Update plugin for wrapping sections * Add dependencies for plugin * Use plugin * Rename wrapper type * Simplify unnessary adding of id to sections The slugified section ids are useless, because they can not be referenced anywhere anyway. The navigation only works if the section has the same id as the heading. * Add plugin for custom attributes on Markdown elements * Add plugin to readd support for tables * Add plugin to fix problem with wrapped images For more details see this issue: https://github.com/mdx-js/mdx/issues/1798 * Add necessary meta data to pages * Install necessary dependencies * Remove outdated MDX handling * Remove reliance on `InlineList` * Use existing Remark components * Remove unallowed heading Before `h1` components where not overwritten and would never have worked and they aren't used anywhere either. * Add missing components to MDX * Add correct styling * Fix broken list * Fix broken CSS classes * Implement layout * Fix links * Fix broken images * Fix pattern image * Fix heading attributes * Rename heading attribute `new` was causing some weird issue, so renaming it to `version` * Update comment syntax in MDX * Merge imports * Fix markdown rendering inside components * Add model pages * Simplify anchors * Fix default value for theme * Add Universe index page * Add Universe categories * Add Universe projects * Fix Next problem with copy Next complains when the server renders something different then the client, therfor we move the differing logic to `useEffect` * Fix improper component nesting Next doesn't allow block elements inside a `<p>` * Replace landing page MDX with page component * Remove inlined iframe content * Remove ability to inline HTML content in iFrames * Remove MDX imports * Fix problem with image inside link in MDX * Escape character for MDX * Fix unescaped characters in MDX * Fix headings with logo * Allow to export static HTML pages * Add prebuild script This command is automatically run by Next * Replace `svg-loader` with `react-inlinesvg` `svg-loader` is no longer maintained * Fix ESLint `react-hooks/exhaustive-deps` * Fix dropdowns * Change code language from `cli` to `bash` * Remove unnessary language `none` * Fix invalid code language `markdown_` with an underscore was used to basically turn of syntax highlighting, but using unknown languages know throws an error. * Enable code blocks plugin * Readd `InlineCode` component MDX2 removed the `inlineCode` component > The special component name `inlineCode` was removed, we recommend to use `pre` for the block version of code, and code for both the block and inline versions Source: https://mdxjs.com/migrating/v2/#update-mdx-content * Remove unused code * Extract function to own file * Fix code syntax highlighting * Update syntax for code block meta data * Remove unused prop * Fix internal link recognition There is a problem with regex between Node and browser, and since Next runs the component on both, this create an error. `Prop `rel` did not match. Server: "null" Client: "noopener nofollow noreferrer"` This simplifies the implementation and fixes the above error. * Replace `react-helmet` with `next/head` * Fix `className` problem for JSX component * Fix broken bold markdown * Convert file to `.mjs` to be used by Node process * Add plugin to replace strings * Fix custom table row styling * Fix problem with `span` inside inline `code` React doesn't allow a `span` inside an inline `code` element and throws an error in dev mode. * Add `_document` to be able to customize `<html>` and `<body>` * Add `lang="en"` * Store Netlify settings in file This way we don't need to update via Netlify UI, which can be tricky if changing build settings. * Add sitemap * Add Smartypants * Add PWA support * Add `manifest.webmanifest` * Fix bug with anchor links after reloading There was no need for the previous implementation, since the browser handles this nativly. Additional the manual scrolling into view was actually broken, because the heading would disappear behind the menu bar. * Rename custom event I was googeling for ages to find out what kind of event `inview` is, only to figure out it was a custom event with a name that sounds pretty much like a native one. 🫠 * Fix missing comment syntax highlighting * Refactor Quickstart component The previous implementation was hidding the irrelevant lines via data-props and dynamically generated CSS. This created problems with Next and was also hard to follow. CSS was used to do what React is supposed to handle. The new implementation simplfy filters the list of children (React elements) via their props. * Fix syntax highlighting for Training Quickstart * Unify code rendering * Improve error logging in Juniper * Fix Juniper component * Automatically generate "Read Next" link * Add Plausible * Use recent DocSearch component and adjust styling * Fix images * Turn of image optimization > Image Optimization using Next.js' default loader is not compatible with `next export`. We currently deploy to Netlify via `next export` * Dont build pages starting with `_` * Remove unused files * Add Next plugin to Netlify * Fix button layout MDX automatically adds `p` tags around text on a new line and Prettier wants to put the text on a new line. Hacking with JSX string. * Add 404 page * Apply Prettier * Update Prettier for `package.json` Next sometimes wants to patch `package-lock.json`. The old Prettier setting indended with 4 spaces, but Next always indends with 2 spaces. Since `npm install` automatically uses the indendation from `package.json` for `package-lock.json` and to avoid the format switching back and forth, both files are now set to 2 spaces. * Apply Next patch to `package-lock.json` When starting the dev server Next would warn `warn - Found lockfile missing swc dependencies, patching...` and update the `package-lock.json`. These are the patched changes. * fix link Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * small backslash fixes * adjust to new style Co-authored-by: Marcus Blättermann <marcus@essenmitsosse.de>
2023-01-11 19:30:07 +03:00
[Gensim](https://radimrehurek.com/gensim/), [FastText](https://fasttext.cc/) or
2020-09-17 18:12:51 +03:00
[GloVe](https://nlp.stanford.edu/projects/glove/), or download existing
2020-10-01 13:15:53 +03:00
pretrained vectors. The [`init vectors`](/api/cli#init-vectors) command lets you
2020-09-17 18:12:51 +03:00
convert vectors for use with spaCy and will give you a directory you can load or
refer to in your [training configs](/usage/training#config).
<Infobox title="Word vectors and similarity" emoji="📖">
For more details on loading word vectors into spaCy, using them for similarity
and improving word vector coverage by truncating and pruning the vectors, see
the usage guide on
[word vectors and similarity](/usage/linguistic-features#vectors-similarity).
</Infobox>
2020-08-18 01:49:19 +03:00
Website migration from Gatsby to Next (#12058) * Rename all MDX file to `.mdx` * Lock current node version (#11885) * Apply Prettier (#11996) * Minor website fixes (#11974) [ci skip] * fix table * Migrate to Next WEB-17 (#12005) * Initial commit * Run `npx create-next-app@13 next-blog` * Install MDX packages Following: https://github.com/vercel/next.js/blob/77b5f79a4dff453abb62346bf75b14d859539b81/packages/next-mdx/readme.md * Add MDX to Next * Allow Next to handle `.md` and `.mdx` files. * Add VSCode extension recommendation * Disabled TypeScript strict mode for now * Add prettier * Apply Prettier to all files * Make sure to use correct Node version * Add basic implementation for `MDXRemote` * Add experimental Rust MDX parser * Add `/public` * Add SASS support * Remove default pages and styling * Convert to module This allows to use `import/export` syntax * Add import for custom components * Add ability to load plugins * Extract function This will make the next commit easier to read * Allow to handle directories for page creation * Refactoring * Allow to parse subfolders for pages * Extract logic * Redirect `index.mdx` to parent directory * Disabled ESLint during builds * Disabled typescript during build * Remove Gatsby from `README.md` * Rephrase Docker part of `README.md` * Update project structure in `README.md` * Move and rename plugins * Update plugin for wrapping sections * Add dependencies for plugin * Use plugin * Rename wrapper type * Simplify unnessary adding of id to sections The slugified section ids are useless, because they can not be referenced anywhere anyway. The navigation only works if the section has the same id as the heading. * Add plugin for custom attributes on Markdown elements * Add plugin to readd support for tables * Add plugin to fix problem with wrapped images For more details see this issue: https://github.com/mdx-js/mdx/issues/1798 * Add necessary meta data to pages * Install necessary dependencies * Remove outdated MDX handling * Remove reliance on `InlineList` * Use existing Remark components * Remove unallowed heading Before `h1` components where not overwritten and would never have worked and they aren't used anywhere either. * Add missing components to MDX * Add correct styling * Fix broken list * Fix broken CSS classes * Implement layout * Fix links * Fix broken images * Fix pattern image * Fix heading attributes * Rename heading attribute `new` was causing some weird issue, so renaming it to `version` * Update comment syntax in MDX * Merge imports * Fix markdown rendering inside components * Add model pages * Simplify anchors * Fix default value for theme * Add Universe index page * Add Universe categories * Add Universe projects * Fix Next problem with copy Next complains when the server renders something different then the client, therfor we move the differing logic to `useEffect` * Fix improper component nesting Next doesn't allow block elements inside a `<p>` * Replace landing page MDX with page component * Remove inlined iframe content * Remove ability to inline HTML content in iFrames * Remove MDX imports * Fix problem with image inside link in MDX * Escape character for MDX * Fix unescaped characters in MDX * Fix headings with logo * Allow to export static HTML pages * Add prebuild script This command is automatically run by Next * Replace `svg-loader` with `react-inlinesvg` `svg-loader` is no longer maintained * Fix ESLint `react-hooks/exhaustive-deps` * Fix dropdowns * Change code language from `cli` to `bash` * Remove unnessary language `none` * Fix invalid code language `markdown_` with an underscore was used to basically turn of syntax highlighting, but using unknown languages know throws an error. * Enable code blocks plugin * Readd `InlineCode` component MDX2 removed the `inlineCode` component > The special component name `inlineCode` was removed, we recommend to use `pre` for the block version of code, and code for both the block and inline versions Source: https://mdxjs.com/migrating/v2/#update-mdx-content * Remove unused code * Extract function to own file * Fix code syntax highlighting * Update syntax for code block meta data * Remove unused prop * Fix internal link recognition There is a problem with regex between Node and browser, and since Next runs the component on both, this create an error. `Prop `rel` did not match. Server: "null" Client: "noopener nofollow noreferrer"` This simplifies the implementation and fixes the above error. * Replace `react-helmet` with `next/head` * Fix `className` problem for JSX component * Fix broken bold markdown * Convert file to `.mjs` to be used by Node process * Add plugin to replace strings * Fix custom table row styling * Fix problem with `span` inside inline `code` React doesn't allow a `span` inside an inline `code` element and throws an error in dev mode. * Add `_document` to be able to customize `<html>` and `<body>` * Add `lang="en"` * Store Netlify settings in file This way we don't need to update via Netlify UI, which can be tricky if changing build settings. * Add sitemap * Add Smartypants * Add PWA support * Add `manifest.webmanifest` * Fix bug with anchor links after reloading There was no need for the previous implementation, since the browser handles this nativly. Additional the manual scrolling into view was actually broken, because the heading would disappear behind the menu bar. * Rename custom event I was googeling for ages to find out what kind of event `inview` is, only to figure out it was a custom event with a name that sounds pretty much like a native one. 🫠 * Fix missing comment syntax highlighting * Refactor Quickstart component The previous implementation was hidding the irrelevant lines via data-props and dynamically generated CSS. This created problems with Next and was also hard to follow. CSS was used to do what React is supposed to handle. The new implementation simplfy filters the list of children (React elements) via their props. * Fix syntax highlighting for Training Quickstart * Unify code rendering * Improve error logging in Juniper * Fix Juniper component * Automatically generate "Read Next" link * Add Plausible * Use recent DocSearch component and adjust styling * Fix images * Turn of image optimization > Image Optimization using Next.js' default loader is not compatible with `next export`. We currently deploy to Netlify via `next export` * Dont build pages starting with `_` * Remove unused files * Add Next plugin to Netlify * Fix button layout MDX automatically adds `p` tags around text on a new line and Prettier wants to put the text on a new line. Hacking with JSX string. * Add 404 page * Apply Prettier * Update Prettier for `package.json` Next sometimes wants to patch `package-lock.json`. The old Prettier setting indended with 4 spaces, but Next always indends with 2 spaces. Since `npm install` automatically uses the indendation from `package.json` for `package-lock.json` and to avoid the format switching back and forth, both files are now set to 2 spaces. * Apply Next patch to `package-lock.json` When starting the dev server Next would warn `warn - Found lockfile missing swc dependencies, patching...` and update the `package-lock.json`. These are the patched changes. * fix link Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * small backslash fixes * adjust to new style Co-authored-by: Marcus Blättermann <marcus@essenmitsosse.de>
2023-01-11 19:30:07 +03:00
### Using word vectors in your models {id="word-vectors-models"}
2020-08-18 01:49:19 +03:00
Many neural network models are able to use word vector tables as additional
features, which sometimes results in significant improvements in accuracy.
spaCy's built-in embedding layer,
[MultiHashEmbed](/api/architectures#MultiHashEmbed), can be configured to use
2020-10-16 00:39:04 +03:00
word vector tables using the `include_static_vectors` flag.
2020-08-18 01:49:19 +03:00
```ini
[tagger.model.tok2vec.embed]
@architectures = "spacy.MultiHashEmbed.v2"
2020-08-18 01:49:19 +03:00
width = 128
2020-10-09 15:42:07 +03:00
attrs = ["LOWER","PREFIX","SUFFIX","SHAPE"]
rows = [5000,2500,2500,2500]
include_static_vectors = true
2020-08-18 01:49:19 +03:00
```
<Infobox title="How it works" emoji="💡">
The configuration system will look up the string `"spacy.MultiHashEmbed.v2"` in
2020-08-18 01:49:19 +03:00
the `architectures` [registry](/api/top-level#registry), and call the returned
object with the rest of the arguments from the block. This will result in a call
to the
[`MultiHashEmbed`](https://github.com/explosion/spacy/tree/develop/spacy/ml/models/tok2vec.py)
function, which will return a [Thinc](https://thinc.ai) model object with the
type signature ~~Model[List[Doc], List[Floats2d]]~~. Because the embedding layer
takes a list of `Doc` objects as input, it does not need to store a copy of the
vectors table. The vectors will be retrieved from the `Doc` objects that are
passed in, via the `doc.vocab.vectors` attribute. This part of the process is
handled by the [StaticVectors](/api/architectures#StaticVectors) layer.
</Infobox>
Website migration from Gatsby to Next (#12058) * Rename all MDX file to `.mdx` * Lock current node version (#11885) * Apply Prettier (#11996) * Minor website fixes (#11974) [ci skip] * fix table * Migrate to Next WEB-17 (#12005) * Initial commit * Run `npx create-next-app@13 next-blog` * Install MDX packages Following: https://github.com/vercel/next.js/blob/77b5f79a4dff453abb62346bf75b14d859539b81/packages/next-mdx/readme.md * Add MDX to Next * Allow Next to handle `.md` and `.mdx` files. * Add VSCode extension recommendation * Disabled TypeScript strict mode for now * Add prettier * Apply Prettier to all files * Make sure to use correct Node version * Add basic implementation for `MDXRemote` * Add experimental Rust MDX parser * Add `/public` * Add SASS support * Remove default pages and styling * Convert to module This allows to use `import/export` syntax * Add import for custom components * Add ability to load plugins * Extract function This will make the next commit easier to read * Allow to handle directories for page creation * Refactoring * Allow to parse subfolders for pages * Extract logic * Redirect `index.mdx` to parent directory * Disabled ESLint during builds * Disabled typescript during build * Remove Gatsby from `README.md` * Rephrase Docker part of `README.md` * Update project structure in `README.md` * Move and rename plugins * Update plugin for wrapping sections * Add dependencies for plugin * Use plugin * Rename wrapper type * Simplify unnessary adding of id to sections The slugified section ids are useless, because they can not be referenced anywhere anyway. The navigation only works if the section has the same id as the heading. * Add plugin for custom attributes on Markdown elements * Add plugin to readd support for tables * Add plugin to fix problem with wrapped images For more details see this issue: https://github.com/mdx-js/mdx/issues/1798 * Add necessary meta data to pages * Install necessary dependencies * Remove outdated MDX handling * Remove reliance on `InlineList` * Use existing Remark components * Remove unallowed heading Before `h1` components where not overwritten and would never have worked and they aren't used anywhere either. * Add missing components to MDX * Add correct styling * Fix broken list * Fix broken CSS classes * Implement layout * Fix links * Fix broken images * Fix pattern image * Fix heading attributes * Rename heading attribute `new` was causing some weird issue, so renaming it to `version` * Update comment syntax in MDX * Merge imports * Fix markdown rendering inside components * Add model pages * Simplify anchors * Fix default value for theme * Add Universe index page * Add Universe categories * Add Universe projects * Fix Next problem with copy Next complains when the server renders something different then the client, therfor we move the differing logic to `useEffect` * Fix improper component nesting Next doesn't allow block elements inside a `<p>` * Replace landing page MDX with page component * Remove inlined iframe content * Remove ability to inline HTML content in iFrames * Remove MDX imports * Fix problem with image inside link in MDX * Escape character for MDX * Fix unescaped characters in MDX * Fix headings with logo * Allow to export static HTML pages * Add prebuild script This command is automatically run by Next * Replace `svg-loader` with `react-inlinesvg` `svg-loader` is no longer maintained * Fix ESLint `react-hooks/exhaustive-deps` * Fix dropdowns * Change code language from `cli` to `bash` * Remove unnessary language `none` * Fix invalid code language `markdown_` with an underscore was used to basically turn of syntax highlighting, but using unknown languages know throws an error. * Enable code blocks plugin * Readd `InlineCode` component MDX2 removed the `inlineCode` component > The special component name `inlineCode` was removed, we recommend to use `pre` for the block version of code, and code for both the block and inline versions Source: https://mdxjs.com/migrating/v2/#update-mdx-content * Remove unused code * Extract function to own file * Fix code syntax highlighting * Update syntax for code block meta data * Remove unused prop * Fix internal link recognition There is a problem with regex between Node and browser, and since Next runs the component on both, this create an error. `Prop `rel` did not match. Server: "null" Client: "noopener nofollow noreferrer"` This simplifies the implementation and fixes the above error. * Replace `react-helmet` with `next/head` * Fix `className` problem for JSX component * Fix broken bold markdown * Convert file to `.mjs` to be used by Node process * Add plugin to replace strings * Fix custom table row styling * Fix problem with `span` inside inline `code` React doesn't allow a `span` inside an inline `code` element and throws an error in dev mode. * Add `_document` to be able to customize `<html>` and `<body>` * Add `lang="en"` * Store Netlify settings in file This way we don't need to update via Netlify UI, which can be tricky if changing build settings. * Add sitemap * Add Smartypants * Add PWA support * Add `manifest.webmanifest` * Fix bug with anchor links after reloading There was no need for the previous implementation, since the browser handles this nativly. Additional the manual scrolling into view was actually broken, because the heading would disappear behind the menu bar. * Rename custom event I was googeling for ages to find out what kind of event `inview` is, only to figure out it was a custom event with a name that sounds pretty much like a native one. 🫠 * Fix missing comment syntax highlighting * Refactor Quickstart component The previous implementation was hidding the irrelevant lines via data-props and dynamically generated CSS. This created problems with Next and was also hard to follow. CSS was used to do what React is supposed to handle. The new implementation simplfy filters the list of children (React elements) via their props. * Fix syntax highlighting for Training Quickstart * Unify code rendering * Improve error logging in Juniper * Fix Juniper component * Automatically generate "Read Next" link * Add Plausible * Use recent DocSearch component and adjust styling * Fix images * Turn of image optimization > Image Optimization using Next.js' default loader is not compatible with `next export`. We currently deploy to Netlify via `next export` * Dont build pages starting with `_` * Remove unused files * Add Next plugin to Netlify * Fix button layout MDX automatically adds `p` tags around text on a new line and Prettier wants to put the text on a new line. Hacking with JSX string. * Add 404 page * Apply Prettier * Update Prettier for `package.json` Next sometimes wants to patch `package-lock.json`. The old Prettier setting indended with 4 spaces, but Next always indends with 2 spaces. Since `npm install` automatically uses the indendation from `package.json` for `package-lock.json` and to avoid the format switching back and forth, both files are now set to 2 spaces. * Apply Next patch to `package-lock.json` When starting the dev server Next would warn `warn - Found lockfile missing swc dependencies, patching...` and update the `package-lock.json`. These are the patched changes. * fix link Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * small backslash fixes * adjust to new style Co-authored-by: Marcus Blättermann <marcus@essenmitsosse.de>
2023-01-11 19:30:07 +03:00
#### Creating a custom embedding layer {id="custom-embedding-layer"}
2020-08-18 01:49:19 +03:00
The [MultiHashEmbed](/api/architectures#StaticVectors) layer is spaCy's
recommended strategy for constructing initial word representations for your
neural network models, but you can also implement your own. You can register any
function to a string name, and then reference that function within your config
(see the [training docs](/usage/training) for more details). To try this out,
you can save the following little example to a new Python file:
```python
from spacy.ml.staticvectors import StaticVectors
from spacy.util import registry
print("I was imported!")
@registry.architectures("my_example.MyEmbedding.v1")
def MyEmbedding(output_width: int) -> Model[List[Doc], List[Floats2d]]:
print("I was called!")
return StaticVectors(nO=output_width)
```
If you pass the path to your file to the [`spacy train`](/api/cli#train) command
using the `--code` argument, your file will be imported, which means the
decorator registering the function will be run. Your function is now on equal
footing with any of spaCy's built-ins, so you can drop it in instead of any
other model with the same input and output signature. For instance, you could
use it in the tagger model as follows:
```ini
[tagger.model.tok2vec.embed]
@architectures = "my_example.MyEmbedding.v1"
output_width = 128
```
Now that you have a custom function wired into the network, you can start
implementing the logic you're interested in. For example, let's say you want to
try a relatively simple embedding strategy that makes use of static word
vectors, but combines them via summation with a smaller table of learned
embeddings.
```python
from thinc.api import add, chain, remap_ids, Embed
2020-08-18 01:49:19 +03:00
from spacy.ml.staticvectors import StaticVectors
from spacy.ml.featureextractor import FeatureExtractor
2020-08-27 15:50:43 +03:00
from spacy.util import registry
2020-08-18 01:49:19 +03:00
@registry.architectures("my_example.MyEmbedding.v1")
def MyCustomVectors(
output_width: int,
vector_width: int,
embed_rows: int,
key2row: Dict[int, int]
) -> Model[List[Doc], List[Floats2d]]:
return add(
StaticVectors(nO=output_width),
chain(
FeatureExtractor(["ORTH"]),
remap_ids(key2row),
Embed(nO=output_width, nV=embed_rows)
)
)
```
#### Creating a custom vectors implementation {id="custom-vectors",version="3.7"}
You can specify a custom registered vectors class under `[nlp.vectors]` in order
to use static vectors in formats other than the ones supported by
[`Vectors`](/api/vectors). Extend the abstract [`BaseVectors`](/api/basevectors)
class to implement your custom vectors.
As an example, the following `BPEmbVectors` class implements support for
[BPEmb subword embeddings](https://bpemb.h-its.org/):
```python
# requires: pip install bpemb
import warnings
from pathlib import Path
from typing import Callable, Optional, cast
from bpemb import BPEmb
from thinc.api import Ops, get_current_ops
from thinc.backends import get_array_ops
from thinc.types import Floats2d
from spacy.strings import StringStore
from spacy.util import registry
from spacy.vectors import BaseVectors
from spacy.vocab import Vocab
class BPEmbVectors(BaseVectors):
def __init__(
self,
*,
strings: Optional[StringStore] = None,
lang: Optional[str] = None,
vs: Optional[int] = None,
dim: Optional[int] = None,
cache_dir: Optional[Path] = None,
encode_extra_options: Optional[str] = None,
model_file: Optional[Path] = None,
emb_file: Optional[Path] = None,
):
kwargs = {}
if lang is not None:
kwargs["lang"] = lang
if vs is not None:
kwargs["vs"] = vs
if dim is not None:
kwargs["dim"] = dim
if cache_dir is not None:
kwargs["cache_dir"] = cache_dir
if encode_extra_options is not None:
kwargs["encode_extra_options"] = encode_extra_options
if model_file is not None:
kwargs["model_file"] = model_file
if emb_file is not None:
kwargs["emb_file"] = emb_file
self.bpemb = BPEmb(**kwargs)
self.strings = strings
self.name = repr(self.bpemb)
self.n_keys = -1
self.mode = "BPEmb"
self.to_ops(get_current_ops())
def __contains__(self, key):
return True
def is_full(self):
return True
def add(self, key, *, vector=None, row=None):
warnings.warn(
(
"Skipping BPEmbVectors.add: the bpemb vector table cannot be "
"modified. Vectors are calculated from bytepieces."
)
)
return -1
def __getitem__(self, key):
return self.get_batch([key])[0]
def get_batch(self, keys):
keys = [self.strings.as_string(key) for key in keys]
bp_ids = self.bpemb.encode_ids(keys)
ops = get_array_ops(self.bpemb.emb.vectors)
indices = ops.asarray(ops.xp.hstack(bp_ids), dtype="int32")
lengths = ops.asarray([len(x) for x in bp_ids], dtype="int32")
vecs = ops.reduce_mean(cast(Floats2d, self.bpemb.emb.vectors[indices]), lengths)
return vecs
@property
def shape(self):
return self.bpemb.vectors.shape
def __len__(self):
return self.shape[0]
@property
def vectors_length(self):
return self.shape[1]
@property
def size(self):
return self.bpemb.vectors.size
def to_ops(self, ops: Ops):
self.bpemb.emb.vectors = ops.asarray(self.bpemb.emb.vectors)
@registry.vectors("BPEmbVectors.v1")
def create_bpemb_vectors(
lang: Optional[str] = "multi",
vs: Optional[int] = None,
dim: Optional[int] = None,
cache_dir: Optional[Path] = None,
encode_extra_options: Optional[str] = None,
model_file: Optional[Path] = None,
emb_file: Optional[Path] = None,
) -> Callable[[Vocab], BPEmbVectors]:
def bpemb_vectors_factory(vocab: Vocab) -> BPEmbVectors:
return BPEmbVectors(
strings=vocab.strings,
lang=lang,
vs=vs,
dim=dim,
cache_dir=cache_dir,
encode_extra_options=encode_extra_options,
model_file=model_file,
emb_file=emb_file,
)
return bpemb_vectors_factory
```
<Infobox variant="warning">
Note that the serialization methods are not implemented, so the embeddings are
loaded from your local cache or downloaded by `BPEmb` each time the pipeline is
loaded.
</Infobox>
To use this in your pipeline, specify this registered function under
`[nlp.vectors]` in your config:
```ini
[nlp.vectors]
@vectors = "BPEmbVectors.v1"
lang = "en"
```
Or specify it when creating a blank pipeline:
```python
nlp = spacy.blank("en", config={"nlp.vectors": {"@vectors": "BPEmbVectors.v1", "lang": "en"}})
```
Remember to include this code with `--code` when using
[`spacy train`](/api/cli#train) and [`spacy package`](/api/cli#package).
Website migration from Gatsby to Next (#12058) * Rename all MDX file to `.mdx` * Lock current node version (#11885) * Apply Prettier (#11996) * Minor website fixes (#11974) [ci skip] * fix table * Migrate to Next WEB-17 (#12005) * Initial commit * Run `npx create-next-app@13 next-blog` * Install MDX packages Following: https://github.com/vercel/next.js/blob/77b5f79a4dff453abb62346bf75b14d859539b81/packages/next-mdx/readme.md * Add MDX to Next * Allow Next to handle `.md` and `.mdx` files. * Add VSCode extension recommendation * Disabled TypeScript strict mode for now * Add prettier * Apply Prettier to all files * Make sure to use correct Node version * Add basic implementation for `MDXRemote` * Add experimental Rust MDX parser * Add `/public` * Add SASS support * Remove default pages and styling * Convert to module This allows to use `import/export` syntax * Add import for custom components * Add ability to load plugins * Extract function This will make the next commit easier to read * Allow to handle directories for page creation * Refactoring * Allow to parse subfolders for pages * Extract logic * Redirect `index.mdx` to parent directory * Disabled ESLint during builds * Disabled typescript during build * Remove Gatsby from `README.md` * Rephrase Docker part of `README.md` * Update project structure in `README.md` * Move and rename plugins * Update plugin for wrapping sections * Add dependencies for plugin * Use plugin * Rename wrapper type * Simplify unnessary adding of id to sections The slugified section ids are useless, because they can not be referenced anywhere anyway. The navigation only works if the section has the same id as the heading. * Add plugin for custom attributes on Markdown elements * Add plugin to readd support for tables * Add plugin to fix problem with wrapped images For more details see this issue: https://github.com/mdx-js/mdx/issues/1798 * Add necessary meta data to pages * Install necessary dependencies * Remove outdated MDX handling * Remove reliance on `InlineList` * Use existing Remark components * Remove unallowed heading Before `h1` components where not overwritten and would never have worked and they aren't used anywhere either. * Add missing components to MDX * Add correct styling * Fix broken list * Fix broken CSS classes * Implement layout * Fix links * Fix broken images * Fix pattern image * Fix heading attributes * Rename heading attribute `new` was causing some weird issue, so renaming it to `version` * Update comment syntax in MDX * Merge imports * Fix markdown rendering inside components * Add model pages * Simplify anchors * Fix default value for theme * Add Universe index page * Add Universe categories * Add Universe projects * Fix Next problem with copy Next complains when the server renders something different then the client, therfor we move the differing logic to `useEffect` * Fix improper component nesting Next doesn't allow block elements inside a `<p>` * Replace landing page MDX with page component * Remove inlined iframe content * Remove ability to inline HTML content in iFrames * Remove MDX imports * Fix problem with image inside link in MDX * Escape character for MDX * Fix unescaped characters in MDX * Fix headings with logo * Allow to export static HTML pages * Add prebuild script This command is automatically run by Next * Replace `svg-loader` with `react-inlinesvg` `svg-loader` is no longer maintained * Fix ESLint `react-hooks/exhaustive-deps` * Fix dropdowns * Change code language from `cli` to `bash` * Remove unnessary language `none` * Fix invalid code language `markdown_` with an underscore was used to basically turn of syntax highlighting, but using unknown languages know throws an error. * Enable code blocks plugin * Readd `InlineCode` component MDX2 removed the `inlineCode` component > The special component name `inlineCode` was removed, we recommend to use `pre` for the block version of code, and code for both the block and inline versions Source: https://mdxjs.com/migrating/v2/#update-mdx-content * Remove unused code * Extract function to own file * Fix code syntax highlighting * Update syntax for code block meta data * Remove unused prop * Fix internal link recognition There is a problem with regex between Node and browser, and since Next runs the component on both, this create an error. `Prop `rel` did not match. Server: "null" Client: "noopener nofollow noreferrer"` This simplifies the implementation and fixes the above error. * Replace `react-helmet` with `next/head` * Fix `className` problem for JSX component * Fix broken bold markdown * Convert file to `.mjs` to be used by Node process * Add plugin to replace strings * Fix custom table row styling * Fix problem with `span` inside inline `code` React doesn't allow a `span` inside an inline `code` element and throws an error in dev mode. * Add `_document` to be able to customize `<html>` and `<body>` * Add `lang="en"` * Store Netlify settings in file This way we don't need to update via Netlify UI, which can be tricky if changing build settings. * Add sitemap * Add Smartypants * Add PWA support * Add `manifest.webmanifest` * Fix bug with anchor links after reloading There was no need for the previous implementation, since the browser handles this nativly. Additional the manual scrolling into view was actually broken, because the heading would disappear behind the menu bar. * Rename custom event I was googeling for ages to find out what kind of event `inview` is, only to figure out it was a custom event with a name that sounds pretty much like a native one. 🫠 * Fix missing comment syntax highlighting * Refactor Quickstart component The previous implementation was hidding the irrelevant lines via data-props and dynamically generated CSS. This created problems with Next and was also hard to follow. CSS was used to do what React is supposed to handle. The new implementation simplfy filters the list of children (React elements) via their props. * Fix syntax highlighting for Training Quickstart * Unify code rendering * Improve error logging in Juniper * Fix Juniper component * Automatically generate "Read Next" link * Add Plausible * Use recent DocSearch component and adjust styling * Fix images * Turn of image optimization > Image Optimization using Next.js' default loader is not compatible with `next export`. We currently deploy to Netlify via `next export` * Dont build pages starting with `_` * Remove unused files * Add Next plugin to Netlify * Fix button layout MDX automatically adds `p` tags around text on a new line and Prettier wants to put the text on a new line. Hacking with JSX string. * Add 404 page * Apply Prettier * Update Prettier for `package.json` Next sometimes wants to patch `package-lock.json`. The old Prettier setting indended with 4 spaces, but Next always indends with 2 spaces. Since `npm install` automatically uses the indendation from `package.json` for `package-lock.json` and to avoid the format switching back and forth, both files are now set to 2 spaces. * Apply Next patch to `package-lock.json` When starting the dev server Next would warn `warn - Found lockfile missing swc dependencies, patching...` and update the `package-lock.json`. These are the patched changes. * fix link Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * small backslash fixes * adjust to new style Co-authored-by: Marcus Blättermann <marcus@essenmitsosse.de>
2023-01-11 19:30:07 +03:00
## Pretraining {id="pretraining"}
2020-08-18 01:49:19 +03:00
2020-09-17 20:24:48 +03:00
The [`spacy pretrain`](/api/cli#pretrain) command lets you initialize your
models with **information from raw text**. Without pretraining, the models for
your components will usually be initialized randomly. The idea behind
pretraining is simple: random probably isn't optimal, so if we have some text to
learn from, we can probably find a way to get the model off to a better start.
Pretraining uses the same [`config.cfg`](/usage/training#config) file as the
regular training, which helps keep the settings and hyperparameters consistent.
The additional `[pretraining]` section has several configuration subsections
that are familiar from the training block: the `[pretraining.batcher]`,
`[pretraining.optimizer]` and `[pretraining.corpus]` all work the same way and
2020-09-17 19:17:03 +03:00
expect the same types of objects, although for pretraining your corpus does not
2020-09-17 20:24:48 +03:00
need to have any annotations, so you will often use a different reader, such as
2020-10-02 02:36:06 +03:00
the [`JsonlCorpus`](/api/top-level#jsonlcorpus).
2020-08-25 12:54:37 +03:00
> #### Raw text format
>
2020-09-17 20:24:48 +03:00
> The raw text can be provided in spaCy's
> [binary `.spacy` format](/api/data-formats#training) consisting of serialized
> `Doc` objects or as a JSONL (newline-delimited JSON) with a key `"text"` per
> entry. This allows the data to be read in line by line, while also allowing
> you to include newlines in the texts.
2020-08-25 12:54:37 +03:00
>
> ```json
> {"text": "Can I ask where you work now and what you do, and if you enjoy it?"}
> {"text": "They may just pull out of the Seattle market completely, at least until they have autonomous vehicles."}
> ```
2020-09-17 20:24:48 +03:00
>
> You can also use your own custom corpus loader instead.
You can add a `[pretraining]` block to your config by setting the
`--pretraining` flag on [`init config`](/api/cli#init-config) or
[`init fill-config`](/api/cli#init-fill-config):
2020-08-25 12:54:37 +03:00
Website migration from Gatsby to Next (#12058) * Rename all MDX file to `.mdx` * Lock current node version (#11885) * Apply Prettier (#11996) * Minor website fixes (#11974) [ci skip] * fix table * Migrate to Next WEB-17 (#12005) * Initial commit * Run `npx create-next-app@13 next-blog` * Install MDX packages Following: https://github.com/vercel/next.js/blob/77b5f79a4dff453abb62346bf75b14d859539b81/packages/next-mdx/readme.md * Add MDX to Next * Allow Next to handle `.md` and `.mdx` files. * Add VSCode extension recommendation * Disabled TypeScript strict mode for now * Add prettier * Apply Prettier to all files * Make sure to use correct Node version * Add basic implementation for `MDXRemote` * Add experimental Rust MDX parser * Add `/public` * Add SASS support * Remove default pages and styling * Convert to module This allows to use `import/export` syntax * Add import for custom components * Add ability to load plugins * Extract function This will make the next commit easier to read * Allow to handle directories for page creation * Refactoring * Allow to parse subfolders for pages * Extract logic * Redirect `index.mdx` to parent directory * Disabled ESLint during builds * Disabled typescript during build * Remove Gatsby from `README.md` * Rephrase Docker part of `README.md` * Update project structure in `README.md` * Move and rename plugins * Update plugin for wrapping sections * Add dependencies for plugin * Use plugin * Rename wrapper type * Simplify unnessary adding of id to sections The slugified section ids are useless, because they can not be referenced anywhere anyway. The navigation only works if the section has the same id as the heading. * Add plugin for custom attributes on Markdown elements * Add plugin to readd support for tables * Add plugin to fix problem with wrapped images For more details see this issue: https://github.com/mdx-js/mdx/issues/1798 * Add necessary meta data to pages * Install necessary dependencies * Remove outdated MDX handling * Remove reliance on `InlineList` * Use existing Remark components * Remove unallowed heading Before `h1` components where not overwritten and would never have worked and they aren't used anywhere either. * Add missing components to MDX * Add correct styling * Fix broken list * Fix broken CSS classes * Implement layout * Fix links * Fix broken images * Fix pattern image * Fix heading attributes * Rename heading attribute `new` was causing some weird issue, so renaming it to `version` * Update comment syntax in MDX * Merge imports * Fix markdown rendering inside components * Add model pages * Simplify anchors * Fix default value for theme * Add Universe index page * Add Universe categories * Add Universe projects * Fix Next problem with copy Next complains when the server renders something different then the client, therfor we move the differing logic to `useEffect` * Fix improper component nesting Next doesn't allow block elements inside a `<p>` * Replace landing page MDX with page component * Remove inlined iframe content * Remove ability to inline HTML content in iFrames * Remove MDX imports * Fix problem with image inside link in MDX * Escape character for MDX * Fix unescaped characters in MDX * Fix headings with logo * Allow to export static HTML pages * Add prebuild script This command is automatically run by Next * Replace `svg-loader` with `react-inlinesvg` `svg-loader` is no longer maintained * Fix ESLint `react-hooks/exhaustive-deps` * Fix dropdowns * Change code language from `cli` to `bash` * Remove unnessary language `none` * Fix invalid code language `markdown_` with an underscore was used to basically turn of syntax highlighting, but using unknown languages know throws an error. * Enable code blocks plugin * Readd `InlineCode` component MDX2 removed the `inlineCode` component > The special component name `inlineCode` was removed, we recommend to use `pre` for the block version of code, and code for both the block and inline versions Source: https://mdxjs.com/migrating/v2/#update-mdx-content * Remove unused code * Extract function to own file * Fix code syntax highlighting * Update syntax for code block meta data * Remove unused prop * Fix internal link recognition There is a problem with regex between Node and browser, and since Next runs the component on both, this create an error. `Prop `rel` did not match. Server: "null" Client: "noopener nofollow noreferrer"` This simplifies the implementation and fixes the above error. * Replace `react-helmet` with `next/head` * Fix `className` problem for JSX component * Fix broken bold markdown * Convert file to `.mjs` to be used by Node process * Add plugin to replace strings * Fix custom table row styling * Fix problem with `span` inside inline `code` React doesn't allow a `span` inside an inline `code` element and throws an error in dev mode. * Add `_document` to be able to customize `<html>` and `<body>` * Add `lang="en"` * Store Netlify settings in file This way we don't need to update via Netlify UI, which can be tricky if changing build settings. * Add sitemap * Add Smartypants * Add PWA support * Add `manifest.webmanifest` * Fix bug with anchor links after reloading There was no need for the previous implementation, since the browser handles this nativly. Additional the manual scrolling into view was actually broken, because the heading would disappear behind the menu bar. * Rename custom event I was googeling for ages to find out what kind of event `inview` is, only to figure out it was a custom event with a name that sounds pretty much like a native one. 🫠 * Fix missing comment syntax highlighting * Refactor Quickstart component The previous implementation was hidding the irrelevant lines via data-props and dynamically generated CSS. This created problems with Next and was also hard to follow. CSS was used to do what React is supposed to handle. The new implementation simplfy filters the list of children (React elements) via their props. * Fix syntax highlighting for Training Quickstart * Unify code rendering * Improve error logging in Juniper * Fix Juniper component * Automatically generate "Read Next" link * Add Plausible * Use recent DocSearch component and adjust styling * Fix images * Turn of image optimization > Image Optimization using Next.js' default loader is not compatible with `next export`. We currently deploy to Netlify via `next export` * Dont build pages starting with `_` * Remove unused files * Add Next plugin to Netlify * Fix button layout MDX automatically adds `p` tags around text on a new line and Prettier wants to put the text on a new line. Hacking with JSX string. * Add 404 page * Apply Prettier * Update Prettier for `package.json` Next sometimes wants to patch `package-lock.json`. The old Prettier setting indended with 4 spaces, but Next always indends with 2 spaces. Since `npm install` automatically uses the indendation from `package.json` for `package-lock.json` and to avoid the format switching back and forth, both files are now set to 2 spaces. * Apply Next patch to `package-lock.json` When starting the dev server Next would warn `warn - Found lockfile missing swc dependencies, patching...` and update the `package-lock.json`. These are the patched changes. * fix link Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * small backslash fixes * adjust to new style Co-authored-by: Marcus Blättermann <marcus@essenmitsosse.de>
2023-01-11 19:30:07 +03:00
```bash
2020-08-25 12:54:37 +03:00
$ python -m spacy init fill-config config.cfg config_pretrain.cfg --pretraining
```
2020-09-17 20:24:48 +03:00
You can then run [`spacy pretrain`](/api/cli#pretrain) with the updated config
and pass in optional config overrides, like the path to the raw text file:
Website migration from Gatsby to Next (#12058) * Rename all MDX file to `.mdx` * Lock current node version (#11885) * Apply Prettier (#11996) * Minor website fixes (#11974) [ci skip] * fix table * Migrate to Next WEB-17 (#12005) * Initial commit * Run `npx create-next-app@13 next-blog` * Install MDX packages Following: https://github.com/vercel/next.js/blob/77b5f79a4dff453abb62346bf75b14d859539b81/packages/next-mdx/readme.md * Add MDX to Next * Allow Next to handle `.md` and `.mdx` files. * Add VSCode extension recommendation * Disabled TypeScript strict mode for now * Add prettier * Apply Prettier to all files * Make sure to use correct Node version * Add basic implementation for `MDXRemote` * Add experimental Rust MDX parser * Add `/public` * Add SASS support * Remove default pages and styling * Convert to module This allows to use `import/export` syntax * Add import for custom components * Add ability to load plugins * Extract function This will make the next commit easier to read * Allow to handle directories for page creation * Refactoring * Allow to parse subfolders for pages * Extract logic * Redirect `index.mdx` to parent directory * Disabled ESLint during builds * Disabled typescript during build * Remove Gatsby from `README.md` * Rephrase Docker part of `README.md` * Update project structure in `README.md` * Move and rename plugins * Update plugin for wrapping sections * Add dependencies for plugin * Use plugin * Rename wrapper type * Simplify unnessary adding of id to sections The slugified section ids are useless, because they can not be referenced anywhere anyway. The navigation only works if the section has the same id as the heading. * Add plugin for custom attributes on Markdown elements * Add plugin to readd support for tables * Add plugin to fix problem with wrapped images For more details see this issue: https://github.com/mdx-js/mdx/issues/1798 * Add necessary meta data to pages * Install necessary dependencies * Remove outdated MDX handling * Remove reliance on `InlineList` * Use existing Remark components * Remove unallowed heading Before `h1` components where not overwritten and would never have worked and they aren't used anywhere either. * Add missing components to MDX * Add correct styling * Fix broken list * Fix broken CSS classes * Implement layout * Fix links * Fix broken images * Fix pattern image * Fix heading attributes * Rename heading attribute `new` was causing some weird issue, so renaming it to `version` * Update comment syntax in MDX * Merge imports * Fix markdown rendering inside components * Add model pages * Simplify anchors * Fix default value for theme * Add Universe index page * Add Universe categories * Add Universe projects * Fix Next problem with copy Next complains when the server renders something different then the client, therfor we move the differing logic to `useEffect` * Fix improper component nesting Next doesn't allow block elements inside a `<p>` * Replace landing page MDX with page component * Remove inlined iframe content * Remove ability to inline HTML content in iFrames * Remove MDX imports * Fix problem with image inside link in MDX * Escape character for MDX * Fix unescaped characters in MDX * Fix headings with logo * Allow to export static HTML pages * Add prebuild script This command is automatically run by Next * Replace `svg-loader` with `react-inlinesvg` `svg-loader` is no longer maintained * Fix ESLint `react-hooks/exhaustive-deps` * Fix dropdowns * Change code language from `cli` to `bash` * Remove unnessary language `none` * Fix invalid code language `markdown_` with an underscore was used to basically turn of syntax highlighting, but using unknown languages know throws an error. * Enable code blocks plugin * Readd `InlineCode` component MDX2 removed the `inlineCode` component > The special component name `inlineCode` was removed, we recommend to use `pre` for the block version of code, and code for both the block and inline versions Source: https://mdxjs.com/migrating/v2/#update-mdx-content * Remove unused code * Extract function to own file * Fix code syntax highlighting * Update syntax for code block meta data * Remove unused prop * Fix internal link recognition There is a problem with regex between Node and browser, and since Next runs the component on both, this create an error. `Prop `rel` did not match. Server: "null" Client: "noopener nofollow noreferrer"` This simplifies the implementation and fixes the above error. * Replace `react-helmet` with `next/head` * Fix `className` problem for JSX component * Fix broken bold markdown * Convert file to `.mjs` to be used by Node process * Add plugin to replace strings * Fix custom table row styling * Fix problem with `span` inside inline `code` React doesn't allow a `span` inside an inline `code` element and throws an error in dev mode. * Add `_document` to be able to customize `<html>` and `<body>` * Add `lang="en"` * Store Netlify settings in file This way we don't need to update via Netlify UI, which can be tricky if changing build settings. * Add sitemap * Add Smartypants * Add PWA support * Add `manifest.webmanifest` * Fix bug with anchor links after reloading There was no need for the previous implementation, since the browser handles this nativly. Additional the manual scrolling into view was actually broken, because the heading would disappear behind the menu bar. * Rename custom event I was googeling for ages to find out what kind of event `inview` is, only to figure out it was a custom event with a name that sounds pretty much like a native one. 🫠 * Fix missing comment syntax highlighting * Refactor Quickstart component The previous implementation was hidding the irrelevant lines via data-props and dynamically generated CSS. This created problems with Next and was also hard to follow. CSS was used to do what React is supposed to handle. The new implementation simplfy filters the list of children (React elements) via their props. * Fix syntax highlighting for Training Quickstart * Unify code rendering * Improve error logging in Juniper * Fix Juniper component * Automatically generate "Read Next" link * Add Plausible * Use recent DocSearch component and adjust styling * Fix images * Turn of image optimization > Image Optimization using Next.js' default loader is not compatible with `next export`. We currently deploy to Netlify via `next export` * Dont build pages starting with `_` * Remove unused files * Add Next plugin to Netlify * Fix button layout MDX automatically adds `p` tags around text on a new line and Prettier wants to put the text on a new line. Hacking with JSX string. * Add 404 page * Apply Prettier * Update Prettier for `package.json` Next sometimes wants to patch `package-lock.json`. The old Prettier setting indended with 4 spaces, but Next always indends with 2 spaces. Since `npm install` automatically uses the indendation from `package.json` for `package-lock.json` and to avoid the format switching back and forth, both files are now set to 2 spaces. * Apply Next patch to `package-lock.json` When starting the dev server Next would warn `warn - Found lockfile missing swc dependencies, patching...` and update the `package-lock.json`. These are the patched changes. * fix link Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * small backslash fixes * adjust to new style Co-authored-by: Marcus Blättermann <marcus@essenmitsosse.de>
2023-01-11 19:30:07 +03:00
```bash
2021-09-16 09:54:12 +03:00
$ python -m spacy pretrain config_pretrain.cfg ./output --paths.raw_text text.jsonl
2020-09-17 20:24:48 +03:00
```
2020-09-20 13:30:53 +03:00
The following defaults are used for the `[pretraining]` block and merged into
your existing config when you run [`init config`](/api/cli#init-config) or
[`init fill-config`](/api/cli#init-fill-config) with `--pretraining`. If needed,
you can [configure](#pretraining-configure) the settings and hyperparameters or
change the [objective](#pretraining-objectives).
2020-09-20 13:30:53 +03:00
```ini
%%GITHUB_SPACY/spacy/default_config_pretraining.cfg
```
Website migration from Gatsby to Next (#12058) * Rename all MDX file to `.mdx` * Lock current node version (#11885) * Apply Prettier (#11996) * Minor website fixes (#11974) [ci skip] * fix table * Migrate to Next WEB-17 (#12005) * Initial commit * Run `npx create-next-app@13 next-blog` * Install MDX packages Following: https://github.com/vercel/next.js/blob/77b5f79a4dff453abb62346bf75b14d859539b81/packages/next-mdx/readme.md * Add MDX to Next * Allow Next to handle `.md` and `.mdx` files. * Add VSCode extension recommendation * Disabled TypeScript strict mode for now * Add prettier * Apply Prettier to all files * Make sure to use correct Node version * Add basic implementation for `MDXRemote` * Add experimental Rust MDX parser * Add `/public` * Add SASS support * Remove default pages and styling * Convert to module This allows to use `import/export` syntax * Add import for custom components * Add ability to load plugins * Extract function This will make the next commit easier to read * Allow to handle directories for page creation * Refactoring * Allow to parse subfolders for pages * Extract logic * Redirect `index.mdx` to parent directory * Disabled ESLint during builds * Disabled typescript during build * Remove Gatsby from `README.md` * Rephrase Docker part of `README.md` * Update project structure in `README.md` * Move and rename plugins * Update plugin for wrapping sections * Add dependencies for plugin * Use plugin * Rename wrapper type * Simplify unnessary adding of id to sections The slugified section ids are useless, because they can not be referenced anywhere anyway. The navigation only works if the section has the same id as the heading. * Add plugin for custom attributes on Markdown elements * Add plugin to readd support for tables * Add plugin to fix problem with wrapped images For more details see this issue: https://github.com/mdx-js/mdx/issues/1798 * Add necessary meta data to pages * Install necessary dependencies * Remove outdated MDX handling * Remove reliance on `InlineList` * Use existing Remark components * Remove unallowed heading Before `h1` components where not overwritten and would never have worked and they aren't used anywhere either. * Add missing components to MDX * Add correct styling * Fix broken list * Fix broken CSS classes * Implement layout * Fix links * Fix broken images * Fix pattern image * Fix heading attributes * Rename heading attribute `new` was causing some weird issue, so renaming it to `version` * Update comment syntax in MDX * Merge imports * Fix markdown rendering inside components * Add model pages * Simplify anchors * Fix default value for theme * Add Universe index page * Add Universe categories * Add Universe projects * Fix Next problem with copy Next complains when the server renders something different then the client, therfor we move the differing logic to `useEffect` * Fix improper component nesting Next doesn't allow block elements inside a `<p>` * Replace landing page MDX with page component * Remove inlined iframe content * Remove ability to inline HTML content in iFrames * Remove MDX imports * Fix problem with image inside link in MDX * Escape character for MDX * Fix unescaped characters in MDX * Fix headings with logo * Allow to export static HTML pages * Add prebuild script This command is automatically run by Next * Replace `svg-loader` with `react-inlinesvg` `svg-loader` is no longer maintained * Fix ESLint `react-hooks/exhaustive-deps` * Fix dropdowns * Change code language from `cli` to `bash` * Remove unnessary language `none` * Fix invalid code language `markdown_` with an underscore was used to basically turn of syntax highlighting, but using unknown languages know throws an error. * Enable code blocks plugin * Readd `InlineCode` component MDX2 removed the `inlineCode` component > The special component name `inlineCode` was removed, we recommend to use `pre` for the block version of code, and code for both the block and inline versions Source: https://mdxjs.com/migrating/v2/#update-mdx-content * Remove unused code * Extract function to own file * Fix code syntax highlighting * Update syntax for code block meta data * Remove unused prop * Fix internal link recognition There is a problem with regex between Node and browser, and since Next runs the component on both, this create an error. `Prop `rel` did not match. Server: "null" Client: "noopener nofollow noreferrer"` This simplifies the implementation and fixes the above error. * Replace `react-helmet` with `next/head` * Fix `className` problem for JSX component * Fix broken bold markdown * Convert file to `.mjs` to be used by Node process * Add plugin to replace strings * Fix custom table row styling * Fix problem with `span` inside inline `code` React doesn't allow a `span` inside an inline `code` element and throws an error in dev mode. * Add `_document` to be able to customize `<html>` and `<body>` * Add `lang="en"` * Store Netlify settings in file This way we don't need to update via Netlify UI, which can be tricky if changing build settings. * Add sitemap * Add Smartypants * Add PWA support * Add `manifest.webmanifest` * Fix bug with anchor links after reloading There was no need for the previous implementation, since the browser handles this nativly. Additional the manual scrolling into view was actually broken, because the heading would disappear behind the menu bar. * Rename custom event I was googeling for ages to find out what kind of event `inview` is, only to figure out it was a custom event with a name that sounds pretty much like a native one. 🫠 * Fix missing comment syntax highlighting * Refactor Quickstart component The previous implementation was hidding the irrelevant lines via data-props and dynamically generated CSS. This created problems with Next and was also hard to follow. CSS was used to do what React is supposed to handle. The new implementation simplfy filters the list of children (React elements) via their props. * Fix syntax highlighting for Training Quickstart * Unify code rendering * Improve error logging in Juniper * Fix Juniper component * Automatically generate "Read Next" link * Add Plausible * Use recent DocSearch component and adjust styling * Fix images * Turn of image optimization > Image Optimization using Next.js' default loader is not compatible with `next export`. We currently deploy to Netlify via `next export` * Dont build pages starting with `_` * Remove unused files * Add Next plugin to Netlify * Fix button layout MDX automatically adds `p` tags around text on a new line and Prettier wants to put the text on a new line. Hacking with JSX string. * Add 404 page * Apply Prettier * Update Prettier for `package.json` Next sometimes wants to patch `package-lock.json`. The old Prettier setting indended with 4 spaces, but Next always indends with 2 spaces. Since `npm install` automatically uses the indendation from `package.json` for `package-lock.json` and to avoid the format switching back and forth, both files are now set to 2 spaces. * Apply Next patch to `package-lock.json` When starting the dev server Next would warn `warn - Found lockfile missing swc dependencies, patching...` and update the `package-lock.json`. These are the patched changes. * fix link Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * small backslash fixes * adjust to new style Co-authored-by: Marcus Blättermann <marcus@essenmitsosse.de>
2023-01-11 19:30:07 +03:00
### How pretraining works {id="pretraining-details"}
2020-09-17 20:24:48 +03:00
The impact of [`spacy pretrain`](/api/cli#pretrain) varies, but it will usually
be worth trying if you're **not using a transformer** model and you have
**relatively little training data** (for instance, fewer than 5,000 sentences).
A good rule of thumb is that pretraining will generally give you a similar
accuracy improvement to using word vectors in your model. If word vectors have
given you a 10% error reduction, pretraining with spaCy might give you another
10%, for a 20% error reduction in total.
The [`spacy pretrain`](/api/cli#pretrain) command will take a **specific
subnetwork** within one of your components, and add additional layers to build a
network for a temporary task that forces the model to learn something about
sentence structure and word cooccurrence statistics.
Pretraining produces a **binary weights file** that can be loaded back in at the
start of training, using the configuration option `initialize.init_tok2vec`. The
weights file specifies an initial set of weights. Training then proceeds as
2020-09-17 20:24:48 +03:00
normal.
You can only pretrain one subnetwork from your pipeline at a time, and the
subnetwork must be typed ~~Model[List[Doc], List[Floats2d]]~~ (i.e. it has to be
a "tok2vec" layer). The most common workflow is to use the
[`Tok2Vec`](/api/tok2vec) component to create a shared token-to-vector layer for
several components of your pipeline, and apply pretraining to its whole model.
Website migration from Gatsby to Next (#12058) * Rename all MDX file to `.mdx` * Lock current node version (#11885) * Apply Prettier (#11996) * Minor website fixes (#11974) [ci skip] * fix table * Migrate to Next WEB-17 (#12005) * Initial commit * Run `npx create-next-app@13 next-blog` * Install MDX packages Following: https://github.com/vercel/next.js/blob/77b5f79a4dff453abb62346bf75b14d859539b81/packages/next-mdx/readme.md * Add MDX to Next * Allow Next to handle `.md` and `.mdx` files. * Add VSCode extension recommendation * Disabled TypeScript strict mode for now * Add prettier * Apply Prettier to all files * Make sure to use correct Node version * Add basic implementation for `MDXRemote` * Add experimental Rust MDX parser * Add `/public` * Add SASS support * Remove default pages and styling * Convert to module This allows to use `import/export` syntax * Add import for custom components * Add ability to load plugins * Extract function This will make the next commit easier to read * Allow to handle directories for page creation * Refactoring * Allow to parse subfolders for pages * Extract logic * Redirect `index.mdx` to parent directory * Disabled ESLint during builds * Disabled typescript during build * Remove Gatsby from `README.md` * Rephrase Docker part of `README.md` * Update project structure in `README.md` * Move and rename plugins * Update plugin for wrapping sections * Add dependencies for plugin * Use plugin * Rename wrapper type * Simplify unnessary adding of id to sections The slugified section ids are useless, because they can not be referenced anywhere anyway. The navigation only works if the section has the same id as the heading. * Add plugin for custom attributes on Markdown elements * Add plugin to readd support for tables * Add plugin to fix problem with wrapped images For more details see this issue: https://github.com/mdx-js/mdx/issues/1798 * Add necessary meta data to pages * Install necessary dependencies * Remove outdated MDX handling * Remove reliance on `InlineList` * Use existing Remark components * Remove unallowed heading Before `h1` components where not overwritten and would never have worked and they aren't used anywhere either. * Add missing components to MDX * Add correct styling * Fix broken list * Fix broken CSS classes * Implement layout * Fix links * Fix broken images * Fix pattern image * Fix heading attributes * Rename heading attribute `new` was causing some weird issue, so renaming it to `version` * Update comment syntax in MDX * Merge imports * Fix markdown rendering inside components * Add model pages * Simplify anchors * Fix default value for theme * Add Universe index page * Add Universe categories * Add Universe projects * Fix Next problem with copy Next complains when the server renders something different then the client, therfor we move the differing logic to `useEffect` * Fix improper component nesting Next doesn't allow block elements inside a `<p>` * Replace landing page MDX with page component * Remove inlined iframe content * Remove ability to inline HTML content in iFrames * Remove MDX imports * Fix problem with image inside link in MDX * Escape character for MDX * Fix unescaped characters in MDX * Fix headings with logo * Allow to export static HTML pages * Add prebuild script This command is automatically run by Next * Replace `svg-loader` with `react-inlinesvg` `svg-loader` is no longer maintained * Fix ESLint `react-hooks/exhaustive-deps` * Fix dropdowns * Change code language from `cli` to `bash` * Remove unnessary language `none` * Fix invalid code language `markdown_` with an underscore was used to basically turn of syntax highlighting, but using unknown languages know throws an error. * Enable code blocks plugin * Readd `InlineCode` component MDX2 removed the `inlineCode` component > The special component name `inlineCode` was removed, we recommend to use `pre` for the block version of code, and code for both the block and inline versions Source: https://mdxjs.com/migrating/v2/#update-mdx-content * Remove unused code * Extract function to own file * Fix code syntax highlighting * Update syntax for code block meta data * Remove unused prop * Fix internal link recognition There is a problem with regex between Node and browser, and since Next runs the component on both, this create an error. `Prop `rel` did not match. Server: "null" Client: "noopener nofollow noreferrer"` This simplifies the implementation and fixes the above error. * Replace `react-helmet` with `next/head` * Fix `className` problem for JSX component * Fix broken bold markdown * Convert file to `.mjs` to be used by Node process * Add plugin to replace strings * Fix custom table row styling * Fix problem with `span` inside inline `code` React doesn't allow a `span` inside an inline `code` element and throws an error in dev mode. * Add `_document` to be able to customize `<html>` and `<body>` * Add `lang="en"` * Store Netlify settings in file This way we don't need to update via Netlify UI, which can be tricky if changing build settings. * Add sitemap * Add Smartypants * Add PWA support * Add `manifest.webmanifest` * Fix bug with anchor links after reloading There was no need for the previous implementation, since the browser handles this nativly. Additional the manual scrolling into view was actually broken, because the heading would disappear behind the menu bar. * Rename custom event I was googeling for ages to find out what kind of event `inview` is, only to figure out it was a custom event with a name that sounds pretty much like a native one. 🫠 * Fix missing comment syntax highlighting * Refactor Quickstart component The previous implementation was hidding the irrelevant lines via data-props and dynamically generated CSS. This created problems with Next and was also hard to follow. CSS was used to do what React is supposed to handle. The new implementation simplfy filters the list of children (React elements) via their props. * Fix syntax highlighting for Training Quickstart * Unify code rendering * Improve error logging in Juniper * Fix Juniper component * Automatically generate "Read Next" link * Add Plausible * Use recent DocSearch component and adjust styling * Fix images * Turn of image optimization > Image Optimization using Next.js' default loader is not compatible with `next export`. We currently deploy to Netlify via `next export` * Dont build pages starting with `_` * Remove unused files * Add Next plugin to Netlify * Fix button layout MDX automatically adds `p` tags around text on a new line and Prettier wants to put the text on a new line. Hacking with JSX string. * Add 404 page * Apply Prettier * Update Prettier for `package.json` Next sometimes wants to patch `package-lock.json`. The old Prettier setting indended with 4 spaces, but Next always indends with 2 spaces. Since `npm install` automatically uses the indendation from `package.json` for `package-lock.json` and to avoid the format switching back and forth, both files are now set to 2 spaces. * Apply Next patch to `package-lock.json` When starting the dev server Next would warn `warn - Found lockfile missing swc dependencies, patching...` and update the `package-lock.json`. These are the patched changes. * fix link Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * small backslash fixes * adjust to new style Co-authored-by: Marcus Blättermann <marcus@essenmitsosse.de>
2023-01-11 19:30:07 +03:00
#### Configuring the pretraining {id="pretraining-configure"}
2020-09-17 20:24:48 +03:00
The [`spacy pretrain`](/api/cli#pretrain) command is configured using the
`[pretraining]` section of your [config file](/usage/training#config). The
`component` and `layer` settings tell spaCy how to **find the subnetwork** to
pretrain. The `layer` setting should be either the empty string (to use the
whole model), or a
[node reference](https://thinc.ai/docs/usage-models#model-state). Most of
spaCy's built-in model architectures have a reference named `"tok2vec"` that
will refer to the right layer.
Website migration from Gatsby to Next (#12058) * Rename all MDX file to `.mdx` * Lock current node version (#11885) * Apply Prettier (#11996) * Minor website fixes (#11974) [ci skip] * fix table * Migrate to Next WEB-17 (#12005) * Initial commit * Run `npx create-next-app@13 next-blog` * Install MDX packages Following: https://github.com/vercel/next.js/blob/77b5f79a4dff453abb62346bf75b14d859539b81/packages/next-mdx/readme.md * Add MDX to Next * Allow Next to handle `.md` and `.mdx` files. * Add VSCode extension recommendation * Disabled TypeScript strict mode for now * Add prettier * Apply Prettier to all files * Make sure to use correct Node version * Add basic implementation for `MDXRemote` * Add experimental Rust MDX parser * Add `/public` * Add SASS support * Remove default pages and styling * Convert to module This allows to use `import/export` syntax * Add import for custom components * Add ability to load plugins * Extract function This will make the next commit easier to read * Allow to handle directories for page creation * Refactoring * Allow to parse subfolders for pages * Extract logic * Redirect `index.mdx` to parent directory * Disabled ESLint during builds * Disabled typescript during build * Remove Gatsby from `README.md` * Rephrase Docker part of `README.md` * Update project structure in `README.md` * Move and rename plugins * Update plugin for wrapping sections * Add dependencies for plugin * Use plugin * Rename wrapper type * Simplify unnessary adding of id to sections The slugified section ids are useless, because they can not be referenced anywhere anyway. The navigation only works if the section has the same id as the heading. * Add plugin for custom attributes on Markdown elements * Add plugin to readd support for tables * Add plugin to fix problem with wrapped images For more details see this issue: https://github.com/mdx-js/mdx/issues/1798 * Add necessary meta data to pages * Install necessary dependencies * Remove outdated MDX handling * Remove reliance on `InlineList` * Use existing Remark components * Remove unallowed heading Before `h1` components where not overwritten and would never have worked and they aren't used anywhere either. * Add missing components to MDX * Add correct styling * Fix broken list * Fix broken CSS classes * Implement layout * Fix links * Fix broken images * Fix pattern image * Fix heading attributes * Rename heading attribute `new` was causing some weird issue, so renaming it to `version` * Update comment syntax in MDX * Merge imports * Fix markdown rendering inside components * Add model pages * Simplify anchors * Fix default value for theme * Add Universe index page * Add Universe categories * Add Universe projects * Fix Next problem with copy Next complains when the server renders something different then the client, therfor we move the differing logic to `useEffect` * Fix improper component nesting Next doesn't allow block elements inside a `<p>` * Replace landing page MDX with page component * Remove inlined iframe content * Remove ability to inline HTML content in iFrames * Remove MDX imports * Fix problem with image inside link in MDX * Escape character for MDX * Fix unescaped characters in MDX * Fix headings with logo * Allow to export static HTML pages * Add prebuild script This command is automatically run by Next * Replace `svg-loader` with `react-inlinesvg` `svg-loader` is no longer maintained * Fix ESLint `react-hooks/exhaustive-deps` * Fix dropdowns * Change code language from `cli` to `bash` * Remove unnessary language `none` * Fix invalid code language `markdown_` with an underscore was used to basically turn of syntax highlighting, but using unknown languages know throws an error. * Enable code blocks plugin * Readd `InlineCode` component MDX2 removed the `inlineCode` component > The special component name `inlineCode` was removed, we recommend to use `pre` for the block version of code, and code for both the block and inline versions Source: https://mdxjs.com/migrating/v2/#update-mdx-content * Remove unused code * Extract function to own file * Fix code syntax highlighting * Update syntax for code block meta data * Remove unused prop * Fix internal link recognition There is a problem with regex between Node and browser, and since Next runs the component on both, this create an error. `Prop `rel` did not match. Server: "null" Client: "noopener nofollow noreferrer"` This simplifies the implementation and fixes the above error. * Replace `react-helmet` with `next/head` * Fix `className` problem for JSX component * Fix broken bold markdown * Convert file to `.mjs` to be used by Node process * Add plugin to replace strings * Fix custom table row styling * Fix problem with `span` inside inline `code` React doesn't allow a `span` inside an inline `code` element and throws an error in dev mode. * Add `_document` to be able to customize `<html>` and `<body>` * Add `lang="en"` * Store Netlify settings in file This way we don't need to update via Netlify UI, which can be tricky if changing build settings. * Add sitemap * Add Smartypants * Add PWA support * Add `manifest.webmanifest` * Fix bug with anchor links after reloading There was no need for the previous implementation, since the browser handles this nativly. Additional the manual scrolling into view was actually broken, because the heading would disappear behind the menu bar. * Rename custom event I was googeling for ages to find out what kind of event `inview` is, only to figure out it was a custom event with a name that sounds pretty much like a native one. 🫠 * Fix missing comment syntax highlighting * Refactor Quickstart component The previous implementation was hidding the irrelevant lines via data-props and dynamically generated CSS. This created problems with Next and was also hard to follow. CSS was used to do what React is supposed to handle. The new implementation simplfy filters the list of children (React elements) via their props. * Fix syntax highlighting for Training Quickstart * Unify code rendering * Improve error logging in Juniper * Fix Juniper component * Automatically generate "Read Next" link * Add Plausible * Use recent DocSearch component and adjust styling * Fix images * Turn of image optimization > Image Optimization using Next.js' default loader is not compatible with `next export`. We currently deploy to Netlify via `next export` * Dont build pages starting with `_` * Remove unused files * Add Next plugin to Netlify * Fix button layout MDX automatically adds `p` tags around text on a new line and Prettier wants to put the text on a new line. Hacking with JSX string. * Add 404 page * Apply Prettier * Update Prettier for `package.json` Next sometimes wants to patch `package-lock.json`. The old Prettier setting indended with 4 spaces, but Next always indends with 2 spaces. Since `npm install` automatically uses the indendation from `package.json` for `package-lock.json` and to avoid the format switching back and forth, both files are now set to 2 spaces. * Apply Next patch to `package-lock.json` When starting the dev server Next would warn `warn - Found lockfile missing swc dependencies, patching...` and update the `package-lock.json`. These are the patched changes. * fix link Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * small backslash fixes * adjust to new style Co-authored-by: Marcus Blättermann <marcus@essenmitsosse.de>
2023-01-11 19:30:07 +03:00
```ini {title="config.cfg"}
2020-09-17 20:24:48 +03:00
# 1. Use the whole model of the "tok2vec" component
[pretraining]
component = "tok2vec"
layer = ""
# 2. Pretrain the "tok2vec" node of the "textcat" component
[pretraining]
component = "textcat"
layer = "tok2vec"
2020-08-25 12:54:37 +03:00
```
2020-09-17 20:24:48 +03:00
Website migration from Gatsby to Next (#12058) * Rename all MDX file to `.mdx` * Lock current node version (#11885) * Apply Prettier (#11996) * Minor website fixes (#11974) [ci skip] * fix table * Migrate to Next WEB-17 (#12005) * Initial commit * Run `npx create-next-app@13 next-blog` * Install MDX packages Following: https://github.com/vercel/next.js/blob/77b5f79a4dff453abb62346bf75b14d859539b81/packages/next-mdx/readme.md * Add MDX to Next * Allow Next to handle `.md` and `.mdx` files. * Add VSCode extension recommendation * Disabled TypeScript strict mode for now * Add prettier * Apply Prettier to all files * Make sure to use correct Node version * Add basic implementation for `MDXRemote` * Add experimental Rust MDX parser * Add `/public` * Add SASS support * Remove default pages and styling * Convert to module This allows to use `import/export` syntax * Add import for custom components * Add ability to load plugins * Extract function This will make the next commit easier to read * Allow to handle directories for page creation * Refactoring * Allow to parse subfolders for pages * Extract logic * Redirect `index.mdx` to parent directory * Disabled ESLint during builds * Disabled typescript during build * Remove Gatsby from `README.md` * Rephrase Docker part of `README.md` * Update project structure in `README.md` * Move and rename plugins * Update plugin for wrapping sections * Add dependencies for plugin * Use plugin * Rename wrapper type * Simplify unnessary adding of id to sections The slugified section ids are useless, because they can not be referenced anywhere anyway. The navigation only works if the section has the same id as the heading. * Add plugin for custom attributes on Markdown elements * Add plugin to readd support for tables * Add plugin to fix problem with wrapped images For more details see this issue: https://github.com/mdx-js/mdx/issues/1798 * Add necessary meta data to pages * Install necessary dependencies * Remove outdated MDX handling * Remove reliance on `InlineList` * Use existing Remark components * Remove unallowed heading Before `h1` components where not overwritten and would never have worked and they aren't used anywhere either. * Add missing components to MDX * Add correct styling * Fix broken list * Fix broken CSS classes * Implement layout * Fix links * Fix broken images * Fix pattern image * Fix heading attributes * Rename heading attribute `new` was causing some weird issue, so renaming it to `version` * Update comment syntax in MDX * Merge imports * Fix markdown rendering inside components * Add model pages * Simplify anchors * Fix default value for theme * Add Universe index page * Add Universe categories * Add Universe projects * Fix Next problem with copy Next complains when the server renders something different then the client, therfor we move the differing logic to `useEffect` * Fix improper component nesting Next doesn't allow block elements inside a `<p>` * Replace landing page MDX with page component * Remove inlined iframe content * Remove ability to inline HTML content in iFrames * Remove MDX imports * Fix problem with image inside link in MDX * Escape character for MDX * Fix unescaped characters in MDX * Fix headings with logo * Allow to export static HTML pages * Add prebuild script This command is automatically run by Next * Replace `svg-loader` with `react-inlinesvg` `svg-loader` is no longer maintained * Fix ESLint `react-hooks/exhaustive-deps` * Fix dropdowns * Change code language from `cli` to `bash` * Remove unnessary language `none` * Fix invalid code language `markdown_` with an underscore was used to basically turn of syntax highlighting, but using unknown languages know throws an error. * Enable code blocks plugin * Readd `InlineCode` component MDX2 removed the `inlineCode` component > The special component name `inlineCode` was removed, we recommend to use `pre` for the block version of code, and code for both the block and inline versions Source: https://mdxjs.com/migrating/v2/#update-mdx-content * Remove unused code * Extract function to own file * Fix code syntax highlighting * Update syntax for code block meta data * Remove unused prop * Fix internal link recognition There is a problem with regex between Node and browser, and since Next runs the component on both, this create an error. `Prop `rel` did not match. Server: "null" Client: "noopener nofollow noreferrer"` This simplifies the implementation and fixes the above error. * Replace `react-helmet` with `next/head` * Fix `className` problem for JSX component * Fix broken bold markdown * Convert file to `.mjs` to be used by Node process * Add plugin to replace strings * Fix custom table row styling * Fix problem with `span` inside inline `code` React doesn't allow a `span` inside an inline `code` element and throws an error in dev mode. * Add `_document` to be able to customize `<html>` and `<body>` * Add `lang="en"` * Store Netlify settings in file This way we don't need to update via Netlify UI, which can be tricky if changing build settings. * Add sitemap * Add Smartypants * Add PWA support * Add `manifest.webmanifest` * Fix bug with anchor links after reloading There was no need for the previous implementation, since the browser handles this nativly. Additional the manual scrolling into view was actually broken, because the heading would disappear behind the menu bar. * Rename custom event I was googeling for ages to find out what kind of event `inview` is, only to figure out it was a custom event with a name that sounds pretty much like a native one. 🫠 * Fix missing comment syntax highlighting * Refactor Quickstart component The previous implementation was hidding the irrelevant lines via data-props and dynamically generated CSS. This created problems with Next and was also hard to follow. CSS was used to do what React is supposed to handle. The new implementation simplfy filters the list of children (React elements) via their props. * Fix syntax highlighting for Training Quickstart * Unify code rendering * Improve error logging in Juniper * Fix Juniper component * Automatically generate "Read Next" link * Add Plausible * Use recent DocSearch component and adjust styling * Fix images * Turn of image optimization > Image Optimization using Next.js' default loader is not compatible with `next export`. We currently deploy to Netlify via `next export` * Dont build pages starting with `_` * Remove unused files * Add Next plugin to Netlify * Fix button layout MDX automatically adds `p` tags around text on a new line and Prettier wants to put the text on a new line. Hacking with JSX string. * Add 404 page * Apply Prettier * Update Prettier for `package.json` Next sometimes wants to patch `package-lock.json`. The old Prettier setting indended with 4 spaces, but Next always indends with 2 spaces. Since `npm install` automatically uses the indendation from `package.json` for `package-lock.json` and to avoid the format switching back and forth, both files are now set to 2 spaces. * Apply Next patch to `package-lock.json` When starting the dev server Next would warn `warn - Found lockfile missing swc dependencies, patching...` and update the `package-lock.json`. These are the patched changes. * fix link Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * small backslash fixes * adjust to new style Co-authored-by: Marcus Blättermann <marcus@essenmitsosse.de>
2023-01-11 19:30:07 +03:00
#### Connecting pretraining to training {id="pretraining-training"}
To benefit from pretraining, your training step needs to know to initialize its
`tok2vec` component with the weights learned from the pretraining step. You do
this by setting `initialize.init_tok2vec` to the filename of the `.bin` file
that you want to use from pretraining.
A pretraining step that runs for 5 epochs with an output path of `pretrain/`, as
an example, produces `pretrain/model0.bin` through `pretrain/model4.bin` plus a
copy of the last iteration as `pretrain/model-last.bin`. Additionally, you can
configure `n_save_epoch` to tell pretraining in which epoch interval it should
save the current training progress. To use the final output to initialize your
`tok2vec` layer, you could fill in this value in your config file:
Website migration from Gatsby to Next (#12058) * Rename all MDX file to `.mdx` * Lock current node version (#11885) * Apply Prettier (#11996) * Minor website fixes (#11974) [ci skip] * fix table * Migrate to Next WEB-17 (#12005) * Initial commit * Run `npx create-next-app@13 next-blog` * Install MDX packages Following: https://github.com/vercel/next.js/blob/77b5f79a4dff453abb62346bf75b14d859539b81/packages/next-mdx/readme.md * Add MDX to Next * Allow Next to handle `.md` and `.mdx` files. * Add VSCode extension recommendation * Disabled TypeScript strict mode for now * Add prettier * Apply Prettier to all files * Make sure to use correct Node version * Add basic implementation for `MDXRemote` * Add experimental Rust MDX parser * Add `/public` * Add SASS support * Remove default pages and styling * Convert to module This allows to use `import/export` syntax * Add import for custom components * Add ability to load plugins * Extract function This will make the next commit easier to read * Allow to handle directories for page creation * Refactoring * Allow to parse subfolders for pages * Extract logic * Redirect `index.mdx` to parent directory * Disabled ESLint during builds * Disabled typescript during build * Remove Gatsby from `README.md` * Rephrase Docker part of `README.md` * Update project structure in `README.md` * Move and rename plugins * Update plugin for wrapping sections * Add dependencies for plugin * Use plugin * Rename wrapper type * Simplify unnessary adding of id to sections The slugified section ids are useless, because they can not be referenced anywhere anyway. The navigation only works if the section has the same id as the heading. * Add plugin for custom attributes on Markdown elements * Add plugin to readd support for tables * Add plugin to fix problem with wrapped images For more details see this issue: https://github.com/mdx-js/mdx/issues/1798 * Add necessary meta data to pages * Install necessary dependencies * Remove outdated MDX handling * Remove reliance on `InlineList` * Use existing Remark components * Remove unallowed heading Before `h1` components where not overwritten and would never have worked and they aren't used anywhere either. * Add missing components to MDX * Add correct styling * Fix broken list * Fix broken CSS classes * Implement layout * Fix links * Fix broken images * Fix pattern image * Fix heading attributes * Rename heading attribute `new` was causing some weird issue, so renaming it to `version` * Update comment syntax in MDX * Merge imports * Fix markdown rendering inside components * Add model pages * Simplify anchors * Fix default value for theme * Add Universe index page * Add Universe categories * Add Universe projects * Fix Next problem with copy Next complains when the server renders something different then the client, therfor we move the differing logic to `useEffect` * Fix improper component nesting Next doesn't allow block elements inside a `<p>` * Replace landing page MDX with page component * Remove inlined iframe content * Remove ability to inline HTML content in iFrames * Remove MDX imports * Fix problem with image inside link in MDX * Escape character for MDX * Fix unescaped characters in MDX * Fix headings with logo * Allow to export static HTML pages * Add prebuild script This command is automatically run by Next * Replace `svg-loader` with `react-inlinesvg` `svg-loader` is no longer maintained * Fix ESLint `react-hooks/exhaustive-deps` * Fix dropdowns * Change code language from `cli` to `bash` * Remove unnessary language `none` * Fix invalid code language `markdown_` with an underscore was used to basically turn of syntax highlighting, but using unknown languages know throws an error. * Enable code blocks plugin * Readd `InlineCode` component MDX2 removed the `inlineCode` component > The special component name `inlineCode` was removed, we recommend to use `pre` for the block version of code, and code for both the block and inline versions Source: https://mdxjs.com/migrating/v2/#update-mdx-content * Remove unused code * Extract function to own file * Fix code syntax highlighting * Update syntax for code block meta data * Remove unused prop * Fix internal link recognition There is a problem with regex between Node and browser, and since Next runs the component on both, this create an error. `Prop `rel` did not match. Server: "null" Client: "noopener nofollow noreferrer"` This simplifies the implementation and fixes the above error. * Replace `react-helmet` with `next/head` * Fix `className` problem for JSX component * Fix broken bold markdown * Convert file to `.mjs` to be used by Node process * Add plugin to replace strings * Fix custom table row styling * Fix problem with `span` inside inline `code` React doesn't allow a `span` inside an inline `code` element and throws an error in dev mode. * Add `_document` to be able to customize `<html>` and `<body>` * Add `lang="en"` * Store Netlify settings in file This way we don't need to update via Netlify UI, which can be tricky if changing build settings. * Add sitemap * Add Smartypants * Add PWA support * Add `manifest.webmanifest` * Fix bug with anchor links after reloading There was no need for the previous implementation, since the browser handles this nativly. Additional the manual scrolling into view was actually broken, because the heading would disappear behind the menu bar. * Rename custom event I was googeling for ages to find out what kind of event `inview` is, only to figure out it was a custom event with a name that sounds pretty much like a native one. 🫠 * Fix missing comment syntax highlighting * Refactor Quickstart component The previous implementation was hidding the irrelevant lines via data-props and dynamically generated CSS. This created problems with Next and was also hard to follow. CSS was used to do what React is supposed to handle. The new implementation simplfy filters the list of children (React elements) via their props. * Fix syntax highlighting for Training Quickstart * Unify code rendering * Improve error logging in Juniper * Fix Juniper component * Automatically generate "Read Next" link * Add Plausible * Use recent DocSearch component and adjust styling * Fix images * Turn of image optimization > Image Optimization using Next.js' default loader is not compatible with `next export`. We currently deploy to Netlify via `next export` * Dont build pages starting with `_` * Remove unused files * Add Next plugin to Netlify * Fix button layout MDX automatically adds `p` tags around text on a new line and Prettier wants to put the text on a new line. Hacking with JSX string. * Add 404 page * Apply Prettier * Update Prettier for `package.json` Next sometimes wants to patch `package-lock.json`. The old Prettier setting indended with 4 spaces, but Next always indends with 2 spaces. Since `npm install` automatically uses the indendation from `package.json` for `package-lock.json` and to avoid the format switching back and forth, both files are now set to 2 spaces. * Apply Next patch to `package-lock.json` When starting the dev server Next would warn `warn - Found lockfile missing swc dependencies, patching...` and update the `package-lock.json`. These are the patched changes. * fix link Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * small backslash fixes * adjust to new style Co-authored-by: Marcus Blättermann <marcus@essenmitsosse.de>
2023-01-11 19:30:07 +03:00
```ini {title="config.cfg"}
[paths]
init_tok2vec = "pretrain/model-last.bin"
[initialize]
init_tok2vec = ${paths.init_tok2vec}
```
<Infobox variant="warning">
The outputs of `spacy pretrain` are not the same data format as the pre-packaged
static word vectors that would go into
[`initialize.vectors`](/api/data-formats#config-initialize). The pretraining
output consists of the weights that the `tok2vec` component should start with in
an existing pipeline, so it goes in `initialize.init_tok2vec`.
</Infobox>
Website migration from Gatsby to Next (#12058) * Rename all MDX file to `.mdx` * Lock current node version (#11885) * Apply Prettier (#11996) * Minor website fixes (#11974) [ci skip] * fix table * Migrate to Next WEB-17 (#12005) * Initial commit * Run `npx create-next-app@13 next-blog` * Install MDX packages Following: https://github.com/vercel/next.js/blob/77b5f79a4dff453abb62346bf75b14d859539b81/packages/next-mdx/readme.md * Add MDX to Next * Allow Next to handle `.md` and `.mdx` files. * Add VSCode extension recommendation * Disabled TypeScript strict mode for now * Add prettier * Apply Prettier to all files * Make sure to use correct Node version * Add basic implementation for `MDXRemote` * Add experimental Rust MDX parser * Add `/public` * Add SASS support * Remove default pages and styling * Convert to module This allows to use `import/export` syntax * Add import for custom components * Add ability to load plugins * Extract function This will make the next commit easier to read * Allow to handle directories for page creation * Refactoring * Allow to parse subfolders for pages * Extract logic * Redirect `index.mdx` to parent directory * Disabled ESLint during builds * Disabled typescript during build * Remove Gatsby from `README.md` * Rephrase Docker part of `README.md` * Update project structure in `README.md` * Move and rename plugins * Update plugin for wrapping sections * Add dependencies for plugin * Use plugin * Rename wrapper type * Simplify unnessary adding of id to sections The slugified section ids are useless, because they can not be referenced anywhere anyway. The navigation only works if the section has the same id as the heading. * Add plugin for custom attributes on Markdown elements * Add plugin to readd support for tables * Add plugin to fix problem with wrapped images For more details see this issue: https://github.com/mdx-js/mdx/issues/1798 * Add necessary meta data to pages * Install necessary dependencies * Remove outdated MDX handling * Remove reliance on `InlineList` * Use existing Remark components * Remove unallowed heading Before `h1` components where not overwritten and would never have worked and they aren't used anywhere either. * Add missing components to MDX * Add correct styling * Fix broken list * Fix broken CSS classes * Implement layout * Fix links * Fix broken images * Fix pattern image * Fix heading attributes * Rename heading attribute `new` was causing some weird issue, so renaming it to `version` * Update comment syntax in MDX * Merge imports * Fix markdown rendering inside components * Add model pages * Simplify anchors * Fix default value for theme * Add Universe index page * Add Universe categories * Add Universe projects * Fix Next problem with copy Next complains when the server renders something different then the client, therfor we move the differing logic to `useEffect` * Fix improper component nesting Next doesn't allow block elements inside a `<p>` * Replace landing page MDX with page component * Remove inlined iframe content * Remove ability to inline HTML content in iFrames * Remove MDX imports * Fix problem with image inside link in MDX * Escape character for MDX * Fix unescaped characters in MDX * Fix headings with logo * Allow to export static HTML pages * Add prebuild script This command is automatically run by Next * Replace `svg-loader` with `react-inlinesvg` `svg-loader` is no longer maintained * Fix ESLint `react-hooks/exhaustive-deps` * Fix dropdowns * Change code language from `cli` to `bash` * Remove unnessary language `none` * Fix invalid code language `markdown_` with an underscore was used to basically turn of syntax highlighting, but using unknown languages know throws an error. * Enable code blocks plugin * Readd `InlineCode` component MDX2 removed the `inlineCode` component > The special component name `inlineCode` was removed, we recommend to use `pre` for the block version of code, and code for both the block and inline versions Source: https://mdxjs.com/migrating/v2/#update-mdx-content * Remove unused code * Extract function to own file * Fix code syntax highlighting * Update syntax for code block meta data * Remove unused prop * Fix internal link recognition There is a problem with regex between Node and browser, and since Next runs the component on both, this create an error. `Prop `rel` did not match. Server: "null" Client: "noopener nofollow noreferrer"` This simplifies the implementation and fixes the above error. * Replace `react-helmet` with `next/head` * Fix `className` problem for JSX component * Fix broken bold markdown * Convert file to `.mjs` to be used by Node process * Add plugin to replace strings * Fix custom table row styling * Fix problem with `span` inside inline `code` React doesn't allow a `span` inside an inline `code` element and throws an error in dev mode. * Add `_document` to be able to customize `<html>` and `<body>` * Add `lang="en"` * Store Netlify settings in file This way we don't need to update via Netlify UI, which can be tricky if changing build settings. * Add sitemap * Add Smartypants * Add PWA support * Add `manifest.webmanifest` * Fix bug with anchor links after reloading There was no need for the previous implementation, since the browser handles this nativly. Additional the manual scrolling into view was actually broken, because the heading would disappear behind the menu bar. * Rename custom event I was googeling for ages to find out what kind of event `inview` is, only to figure out it was a custom event with a name that sounds pretty much like a native one. 🫠 * Fix missing comment syntax highlighting * Refactor Quickstart component The previous implementation was hidding the irrelevant lines via data-props and dynamically generated CSS. This created problems with Next and was also hard to follow. CSS was used to do what React is supposed to handle. The new implementation simplfy filters the list of children (React elements) via their props. * Fix syntax highlighting for Training Quickstart * Unify code rendering * Improve error logging in Juniper * Fix Juniper component * Automatically generate "Read Next" link * Add Plausible * Use recent DocSearch component and adjust styling * Fix images * Turn of image optimization > Image Optimization using Next.js' default loader is not compatible with `next export`. We currently deploy to Netlify via `next export` * Dont build pages starting with `_` * Remove unused files * Add Next plugin to Netlify * Fix button layout MDX automatically adds `p` tags around text on a new line and Prettier wants to put the text on a new line. Hacking with JSX string. * Add 404 page * Apply Prettier * Update Prettier for `package.json` Next sometimes wants to patch `package-lock.json`. The old Prettier setting indended with 4 spaces, but Next always indends with 2 spaces. Since `npm install` automatically uses the indendation from `package.json` for `package-lock.json` and to avoid the format switching back and forth, both files are now set to 2 spaces. * Apply Next patch to `package-lock.json` When starting the dev server Next would warn `warn - Found lockfile missing swc dependencies, patching...` and update the `package-lock.json`. These are the patched changes. * fix link Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * small backslash fixes * adjust to new style Co-authored-by: Marcus Blättermann <marcus@essenmitsosse.de>
2023-01-11 19:30:07 +03:00
#### Pretraining objectives {id="pretraining-objectives"}
2020-09-17 20:24:48 +03:00
> ```ini
> ### Characters objective
> [pretraining.objective]
> @architectures = "spacy.PretrainCharacters.v1"
> maxout_pieces = 3
> hidden_size = 300
2020-09-17 20:24:48 +03:00
> n_characters = 4
> ```
>
> ```ini
> ### Vectors objective
> [pretraining.objective]
> @architectures = "spacy.PretrainVectors.v1"
> maxout_pieces = 3
> hidden_size = 300
2020-09-17 20:24:48 +03:00
> loss = "cosine"
> ```
Two pretraining objectives are available, both of which are variants of the
cloze task [Devlin et al. (2018)](https://arxiv.org/abs/1810.04805) introduced
for BERT. The objective can be defined and configured via the
`[pretraining.objective]` config block.
- [`PretrainCharacters`](/api/architectures#pretrain_chars): The `"characters"`
objective asks the model to predict some number of leading and trailing UTF-8
bytes for the words. For instance, setting `n_characters = 2`, the model will
try to predict the first two and last two characters of the word.
2020-09-17 20:24:48 +03:00
- [`PretrainVectors`](/api/architectures#pretrain_vectors): The `"vectors"`
objective asks the model to predict the word's vector, from a static
embeddings table. This requires a word vectors model to be trained and loaded.
The vectors objective can optimize either a cosine or an L2 loss. We've
generally found cosine loss to perform better.
2020-09-17 20:24:48 +03:00
These pretraining objectives use a trick that we term **language modelling with
approximate outputs (LMAO)**. The motivation for the trick is that predicting an
exact word ID introduces a lot of incidental complexity. You need a large output
layer, and even then, the vocabulary is too large, which motivates tokenization
schemes that do not align to actual word boundaries. At the end of training, the
output layer will be thrown away regardless: we just want a task that forces the
network to model something about word cooccurrence statistics. Predicting
leading and trailing characters does that more than adequately, as the exact
word sequence could be recovered with high accuracy if the initial and trailing
characters are predicted accurately. With the vectors objective, the pretraining
uses the embedding space learned by an algorithm such as
2020-09-17 20:24:48 +03:00
[GloVe](https://nlp.stanford.edu/projects/glove/) or
[Word2vec](https://code.google.com/archive/p/word2vec/), allowing the model to
focus on the contextual modelling we actual care about.