spaCy/website/docs/usage/v3-2.mdx
Sofie Van Landeghem 554df9ef20
Website migration from Gatsby to Next (#12058)
* Rename all MDX file to `.mdx`

* Lock current node version (#11885)

* Apply Prettier (#11996)

* Minor website fixes (#11974) [ci skip]

* fix table

* Migrate to Next WEB-17 (#12005)

* Initial commit

* Run `npx create-next-app@13 next-blog`

* Install MDX packages

Following: 77b5f79a4d/packages/next-mdx/readme.md

* Add MDX to Next

* Allow Next to handle `.md` and `.mdx` files.

* Add VSCode extension recommendation

* Disabled TypeScript strict mode for now

* Add prettier

* Apply Prettier to all files

* Make sure to use correct Node version

* Add basic implementation for `MDXRemote`

* Add experimental Rust MDX parser

* Add `/public`

* Add SASS support

* Remove default pages and styling

* Convert to module

This allows to use `import/export` syntax

* Add import for custom components

* Add ability to load plugins

* Extract function

This will make the next commit easier to read

* Allow to handle directories for page creation

* Refactoring

* Allow to parse subfolders for pages

* Extract logic

* Redirect `index.mdx` to parent directory

* Disabled ESLint during builds

* Disabled typescript during build

* Remove Gatsby from `README.md`

* Rephrase Docker part of `README.md`

* Update project structure in `README.md`

* Move and rename plugins

* Update plugin for wrapping sections

* Add dependencies for  plugin

* Use  plugin

* Rename wrapper type

* Simplify unnessary adding of id to sections

The slugified section ids are useless, because they can not be referenced anywhere anyway. The navigation only works if the section has the same id as the heading.

* Add plugin for custom attributes on Markdown elements

* Add plugin to readd support for tables

* Add plugin to fix problem with wrapped images

For more details see this issue: https://github.com/mdx-js/mdx/issues/1798

* Add necessary meta data to pages

* Install necessary dependencies

* Remove outdated MDX handling

* Remove reliance on `InlineList`

* Use existing Remark components

* Remove unallowed heading

Before `h1` components where not overwritten and would never have worked and they aren't used anywhere either.

* Add missing components to MDX

* Add correct styling

* Fix broken list

* Fix broken CSS classes

* Implement layout

* Fix links

* Fix broken images

* Fix pattern image

* Fix heading attributes

* Rename heading attribute

`new` was causing some weird issue, so renaming it to `version`

* Update comment syntax in MDX

* Merge imports

* Fix markdown rendering inside components

* Add model pages

* Simplify anchors

* Fix default value for theme

* Add Universe index page

* Add Universe categories

* Add Universe projects

* Fix Next problem with copy

Next complains when the server renders something different then the client, therfor we move the differing logic to `useEffect`

* Fix improper component nesting

Next doesn't allow block elements inside a `<p>`

* Replace landing page MDX with page component

* Remove inlined iframe content

* Remove ability to inline HTML content in iFrames

* Remove MDX imports

* Fix problem with image inside link in MDX

* Escape character for MDX

* Fix unescaped characters in MDX

* Fix headings with logo

* Allow to export static HTML pages

* Add prebuild script

This command is automatically run by Next

* Replace `svg-loader` with `react-inlinesvg`

`svg-loader` is no longer maintained

* Fix ESLint `react-hooks/exhaustive-deps`

* Fix dropdowns

* Change code language from `cli` to `bash`

* Remove unnessary language `none`

* Fix invalid code language

`markdown_` with an underscore was used to basically turn of syntax highlighting, but using unknown languages know throws an error.

* Enable code blocks plugin

* Readd `InlineCode` component

MDX2 removed the `inlineCode` component

> The special component name `inlineCode` was removed, we recommend to use `pre` for the block version of code, and code for both the block and inline versions

Source: https://mdxjs.com/migrating/v2/#update-mdx-content

* Remove unused code

* Extract function to own file

* Fix code syntax highlighting

* Update syntax for code block meta data

* Remove unused prop

* Fix internal link recognition

There is a problem with regex between Node and browser, and since Next runs the component on both, this create an error.

`Prop `rel` did not match. Server: "null" Client: "noopener nofollow noreferrer"`

This simplifies the implementation and fixes the above error.

* Replace `react-helmet` with `next/head`

* Fix `className` problem for JSX component

* Fix broken bold markdown

* Convert file to `.mjs` to be used by Node process

* Add plugin to replace strings

* Fix custom table row styling

* Fix problem with `span` inside inline `code`

React doesn't allow a `span` inside an inline `code` element and throws an error in dev mode.

* Add `_document` to be able to customize `<html>` and `<body>`

* Add `lang="en"`

* Store Netlify settings in file

This way we don't need to update via Netlify UI, which can be tricky if changing build settings.

* Add sitemap

* Add Smartypants

* Add PWA support

* Add `manifest.webmanifest`

* Fix bug with anchor links after reloading

There was no need for the previous implementation, since the browser handles this nativly. Additional the manual scrolling into view was actually broken, because the heading would disappear behind the menu bar.

* Rename custom event

I was googeling for ages to find out what kind of event `inview` is, only to figure out it was a custom event with a name that sounds pretty much like a native one. 🫠

* Fix missing comment syntax highlighting

* Refactor Quickstart component

The previous implementation was hidding the irrelevant lines via data-props and dynamically generated CSS. This created problems with Next and was also hard to follow. CSS was used to do what React is supposed to handle.

The new implementation simplfy filters the list of children (React elements) via their props.

* Fix syntax highlighting for Training Quickstart

* Unify code rendering

* Improve error logging in Juniper

* Fix Juniper component

* Automatically generate "Read Next" link

* Add Plausible

* Use recent DocSearch component and adjust styling

* Fix images

* Turn of image optimization

> Image Optimization using Next.js' default loader is not compatible with `next export`.

We currently deploy to Netlify via `next export`

* Dont build pages starting with `_`

* Remove unused files

* Add Next plugin to Netlify

* Fix button layout

MDX automatically adds `p` tags around text on a new line and Prettier wants to put the text on a new line. Hacking with JSX string.

* Add 404 page

* Apply Prettier

* Update Prettier for `package.json`

Next sometimes wants to patch `package-lock.json`. The old Prettier setting indended with 4 spaces, but Next always indends with 2 spaces. Since `npm install` automatically uses the indendation from `package.json` for `package-lock.json` and to avoid the format switching back and forth, both files are now set to 2 spaces.

* Apply Next patch to `package-lock.json`

When starting the dev server Next would warn `warn  - Found lockfile missing swc dependencies, patching...` and update the `package-lock.json`. These are the patched changes.

* fix link

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* small backslash fixes

* adjust to new style

Co-authored-by: Marcus Blättermann <marcus@essenmitsosse.de>
2023-01-11 17:30:07 +01:00

243 lines
10 KiB
Plaintext

---
title: What's New in v3.2
teaser: New features and how to upgrade
menu:
- ['New Features', 'features']
- ['Upgrading Notes', 'upgrading']
---
## New Features {id="features",hidden="true"}
spaCy v3.2 adds support for [`floret`](https://github.com/explosion/floret)
vectors, makes custom `Doc` creation and scoring easier, and includes many bug
fixes and improvements. For the trained pipelines, there's a new transformer
pipeline for Japanese and the Universal Dependencies training data has been
updated across the board to the most recent release.
<Infobox title="Improve performance for spaCy on Apple M1 with AppleOps" variant="warning" emoji="📣">
spaCy is now up to **8 &times; faster on M1 Macs** by calling into Apple's
native Accelerate library for matrix multiplication. For more details, see
[`thinc-apple-ops`](https://github.com/explosion/thinc-apple-ops).
```bash
$ pip install spacy[apple]
```
</Infobox>
### Registered scoring functions {id="registered-scoring-functions"}
To customize the scoring, you can specify a scoring function for each component
in your config from the new [`scorers` registry](/api/top-level#registry):
```ini {title="config.cfg (excerpt)",highlight="3"}
[components.tagger]
factory = "tagger"
scorer = {"@scorers":"spacy.tagger_scorer.v1"}
```
### Overwrite settings {id="overwrite"}
Most pipeline components now include an `overwrite` setting in the config that
determines whether existing annotation in the `Doc` is preserved or overwritten:
```ini {title="config.cfg (excerpt)",highlight="3"}
[components.tagger]
factory = "tagger"
overwrite = false
```
### Doc input for pipelines {id="doc-input"}
[`nlp`](/api/language#call) and [`nlp.pipe`](/api/language#pipe) accept
[`Doc`](/api/doc) input, skipping the tokenizer if a `Doc` is provided instead
of a string. This makes it easier to create a `Doc` with custom tokenization or
to set custom extensions before processing:
```python
doc = nlp.make_doc("This is text 500.")
doc._.text_id = 500
doc = nlp(doc)
```
### Support for floret vectors {id="vectors"}
We recently published [`floret`](https://github.com/explosion/floret), an
extended version of [fastText](https://fasttext.cc) that combines fastText's
subwords with Bloom embeddings for compact, full-coverage vectors. The use of
subwords means that there are no OOV words and due to Bloom embeddings, the
vector table can be kept very small at \<100K entries. Bloom embeddings are
already used by [HashEmbed](https://thinc.ai/docs/api-layers#hashembed) in
[tok2vec](/api/architectures#tok2vec-arch) for compact spaCy models.
For easy integration, floret includes a
[Python wrapper](https://github.com/explosion/floret/blob/main/python/README.md):
```bash
$ pip install floret
```
A demo project shows how to train and import floret vectors:
<Project id="pipelines/floret_vectors_demo">
Train toy English floret vectors and import them into a spaCy pipeline.
</Project>
Two additional demo projects compare standard fastText vectors with floret
vectors for full spaCy pipelines. For agglutinative languages like Finnish or
Korean, there are large improvements in performance due to the use of subwords
(no OOV words!), with a vector table containing merely 50K entries.
<Project id="pipelines/floret_fi_core_demo">
Finnish UD+NER vector and pipeline training, comparing standard fasttext vs.
floret vectors.
For the default project settings with 1M (2.6G) tokenized training texts and 50K
300-dim vectors, ~300K keys for the standard vectors:
| Vectors | TAG | POS | DEP UAS | DEP LAS | NER F |
| -------------------------------------------- | -------: | -------: | -------: | -------: | -------: |
| none | 93.3 | 92.3 | 79.7 | 72.8 | 61.0 |
| standard (pruned: 50K vectors for 300K keys) | 95.9 | 94.7 | 83.3 | 77.9 | 68.5 |
| standard (unpruned: 300K vectors/keys) | 96.0 | 95.0 | **83.8** | 78.4 | 69.1 |
| floret (minn 4, maxn 5; 50K vectors, no OOV) | **96.6** | **95.5** | 83.5 | **78.5** | **70.9** |
</Project>
<Project id="pipelines/floret_ko_ud_demo">
Korean UD vector and pipeline training, comparing standard fasttext vs. floret
vectors.
For the default project settings with 1M (3.3G) tokenized training texts and 50K
300-dim vectors, ~800K keys for the standard vectors:
| Vectors | TAG | POS | DEP UAS | DEP LAS |
| -------------------------------------------- | -------: | -------: | -------: | -------: |
| none | 72.5 | 85.0 | 73.2 | 64.3 |
| standard (pruned: 50K vectors for 800K keys) | 77.9 | 89.4 | 78.8 | 72.8 |
| standard (unpruned: 800K vectors/keys) | 79.0 | 90.2 | 79.2 | 73.9 |
| floret (minn 2, maxn 3; 50K vectors, no OOV) | **82.5** | **93.8** | **83.0** | **80.1** |
</Project>
### Updates for spacy-transformers v1.1 {id="spacy-transformers"}
[`spacy-transformers`](https://github.com/explosion/spacy-transformers) v1.1 has
been refactored to improve serialization and support of inline transformer
components and replacing listeners. In addition, the transformer model output is
provided as
[`ModelOutput`](https://huggingface.co/transformers/main_classes/output.html?highlight=modeloutput#transformers.file_utils.ModelOutput)
instead of tuples in
`TransformerData.model_output and FullTransformerBatch.model_output.` For
backwards compatibility, the tuple format remains available under
`TransformerData.tensors` and `FullTransformerBatch.tensors`. See more details
in the [transformer API docs](/api/architectures#TransformerModel).
`spacy-transfomers` v1.1 also adds support for `transformer_config` settings
such as `output_attentions`. Additional output is stored under
`TransformerData.model_output`. More details are in the
[TransformerModel docs](/api/architectures#TransformerModel). The training speed
has been improved by streamlining allocations for tokenizer output and there is
new support for [mixed-precision training](/api/architectures#TransformerModel).
### New transformer package for Japanese {id="pipeline-packages"}
spaCy v3.2 adds a new transformer pipeline package for Japanese
[`ja_core_news_trf`](/models/ja#ja_core_news_trf), which uses the `basic`
pretokenizer instead of `mecab` to limit the number of dependencies required for
the pipeline. Thanks to Hiroshi Matsuda and the spaCy Japanese community for
their contributions!
### Pipeline and language updates {id="pipeline-updates"}
- All Universal Dependencies training data has been updated to v2.8.
- The Catalan data, tokenizer and lemmatizer have been updated, thanks to Carlos
Rodriguez, Carme Armentano and the Barcelona Supercomputing Center!
- The transformer pipelines are trained using spacy-transformers v1.1, with
improved IO and more options for
[model config and output](/api/architectures#TransformerModel).
- Trailing whitespace has been added as a `tok2vec` feature, improving the
performance for many components, especially fine-grained tagging and sentence
segmentation.
- The English attribute ruler patterns have been overhauled to improve
`Token.pos` and `Token.morph`.
spaCy v3.2 also features a new Irish lemmatizer, support for `noun_chunks` in
Portuguese, improved `noun_chunks` for Spanish and additional updates for
Bulgarian, Catalan, Sinhala, Tagalog, Tigrinya and Vietnamese.
## Notes about upgrading from v3.1 {id="upgrading"}
### Pipeline package version compatibility {id="version-compat"}
> #### Using legacy implementations
>
> In spaCy v3, you'll still be able to load and reference legacy implementations
> via [`spacy-legacy`](https://github.com/explosion/spacy-legacy), even if the
> components or architectures change and newer versions are available in the
> core library.
When you're loading a pipeline package trained with spaCy v3.0 or v3.1, you will
see a warning telling you that the pipeline may be incompatible. This doesn't
necessarily have to be true, but we recommend running your pipelines against
your test suite or evaluation data to make sure there are no unexpected results.
If you're using one of the [trained pipelines](/models) we provide, you should
run [`spacy download`](/api/cli#download) to update to the latest version. To
see an overview of all installed packages and their compatibility, you can run
[`spacy validate`](/api/cli#validate).
If you've trained your own custom pipeline and you've confirmed that it's still
working as expected, you can update the spaCy version requirements in the
[`meta.json`](/api/data-formats#meta):
```diff
- "spacy_version": ">=3.1.0,<3.2.0",
+ "spacy_version": ">=3.2.0,<3.3.0",
```
### Updating v3.1 configs
To update a config from spaCy v3.1 with the new v3.2 settings, run
[`init fill-config`](/api/cli#init-fill-config):
```bash
$ python -m spacy init fill-config config-v3.1.cfg config-v3.2.cfg
```
In many cases ([`spacy train`](/api/cli#train),
[`spacy.load`](/api/top-level#spacy.load)), the new defaults will be filled in
automatically, but you'll need to fill in the new settings to run
[`debug config`](/api/cli#debug) and [`debug data`](/api/cli#debug-data).
## Notes about upgrading from spacy-transformers v1.0 {id="upgrading-transformers"}
When you're loading a transformer pipeline package trained with
[`spacy-transformers`](https://github.com/explosion/spacy-transformers) v1.0
after upgrading to `spacy-transformers` v1.1, you'll see a warning telling you
that the pipeline may be incompatible. `spacy-transformers` v1.1 should be able
to import v1.0 `transformer` components into the new internal format with no
change in performance, but here we'd also recommend running your test suite to
verify that the pipeline still performs as expected.
If you save your pipeline with [`nlp.to_disk`](/api/language#to_disk), it will
be saved in the new v1.1 format and should be fully compatible with
`spacy-transformers` v1.1. Once you've confirmed the performance, you can update
the requirements in [`meta.json`](/api/data-formats#meta):
```diff
"requirements": [
- "spacy-transformers>=1.0.3,<1.1.0"
+ "spacy-transformers>=1.1.2,<1.2.0"
]
```
If you're using one of the [trained pipelines](/models) we provide, you should
run [`spacy download`](/api/cli#download) to update to the latest version. To
see an overview of all installed packages and their compatibility, you can run
[`spacy validate`](/api/cli#validate).