mirror of
https://github.com/explosion/spaCy.git
synced 2024-12-24 17:06:29 +03:00
Docs for v3.4 (#11057)
* Add draft of v3.4 usage * Add Croatian models * Add Matcher min/max * Update release notes * Minor edits * Add updates, tables * Update pydantic/mypy versions * Update version in README * Fix sidebar
This commit is contained in:
parent
d583626a82
commit
11f859c132
|
@ -16,7 +16,7 @@ production-ready [**training system**](https://spacy.io/usage/training) and easy
|
|||
model packaging, deployment and workflow management. spaCy is commercial
|
||||
open-source software, released under the MIT license.
|
||||
|
||||
💫 **Version 3.3.1 out now!**
|
||||
💫 **Version 3.4.0 out now!**
|
||||
[Check out the release notes here.](https://github.com/explosion/spaCy/releases)
|
||||
|
||||
[![Azure Pipelines](https://img.shields.io/azure-devops/build/explosion-ai/public/8/master.svg?logo=azure-pipelines&style=flat-square&label=build)](https://dev.azure.com/explosion-ai/public/_build?definitionId=8)
|
||||
|
|
143
website/docs/usage/v3-4.md
Normal file
143
website/docs/usage/v3-4.md
Normal file
|
@ -0,0 +1,143 @@
|
|||
---
|
||||
title: What's New in v3.4
|
||||
teaser: New features and how to upgrade
|
||||
menu:
|
||||
- ['New Features', 'features']
|
||||
- ['Upgrading Notes', 'upgrading']
|
||||
---
|
||||
|
||||
## New features {#features hidden="true"}
|
||||
|
||||
spaCy v3.4 brings typing and speed improvements along with new vectors for
|
||||
English CNN pipelines and new trained pipelines for Croatian. This release also
|
||||
includes prebuilt linux aarch64 wheels for all spaCy dependencies distributed by
|
||||
Explosion.
|
||||
|
||||
### Typing improvements {#typing}
|
||||
|
||||
spaCy v3.4 supports pydantic v1.9 and mypy 0.950+ through extensive updates to
|
||||
types in Thinc v8.1.
|
||||
|
||||
### Speed improvements {#speed}
|
||||
|
||||
- For the parser, use C `saxpy`/`sgemm` provided by the `Ops` implementation in
|
||||
order to use Accelerate through `thinc-apple-ops`.
|
||||
- Improved speed of vector lookups.
|
||||
- Improved speed for `Example.get_aligned_parse` and `Example.get_aligned`.
|
||||
|
||||
## Additional features and improvements
|
||||
|
||||
- Min/max `{n,m}` operator for `Matcher` patterns.
|
||||
- Language updates:
|
||||
- Improve tokenization for Cyrillic combining diacritics.
|
||||
- Improve English tokenizer exceptions for contractions with
|
||||
this/that/these/those.
|
||||
- Updated `spacy project clone` to try both `main` and `master` branches by
|
||||
default.
|
||||
- Added confidence threshold for named entity linker.
|
||||
- Improved handling of Typer optional default values for `init_config_cli`.
|
||||
- Added cycle detection in parser projectivization methods.
|
||||
- Added counts for NER labels in `debug data`.
|
||||
- Support for adding NVTX ranges to `TrainablePipe` components.
|
||||
- Support env variable `SPACY_NUM_BUILD_JOBS` to specify the number of build
|
||||
jobs to run in parallel with `pip`.
|
||||
|
||||
## Trained pipelines {#pipelines}
|
||||
|
||||
### New trained pipelines {#new-pipelines}
|
||||
|
||||
v3.4 introduces new CPU/CNN pipelines for Croatian, which use the trainable
|
||||
lemmatizer and [floret vectors](https://github.com/explosion/floret). Due to the
|
||||
use of [Bloom embeddings](https://explosion.ai/blog/bloom-embeddings) and
|
||||
subwords, the pipelines have compact vectors with no out-of-vocabulary words.
|
||||
|
||||
| Package | UPOS | Parser LAS | NER F |
|
||||
| ----------------------------------------------- | ---: | ---------: | ----: |
|
||||
| [`hr_core_news_sm`](/models/hr#hr_core_news_sm) | 96.6 | 77.5 | 76.1 |
|
||||
| [`hr_core_news_md`](/models/hr#hr_core_news_md) | 97.3 | 80.1 | 81.8 |
|
||||
| [`hr_core_news_lg`](/models/hr#hr_core_news_lg) | 97.5 | 80.4 | 83.0 |
|
||||
|
||||
### Pipeline updates {#pipeline-updates}
|
||||
|
||||
All CNN pipelines have been extended with whitespace augmentation.
|
||||
|
||||
The English CNN pipelines have new word vectors:
|
||||
|
||||
| Package | Model Version | TAG | Parser LAS | NER F |
|
||||
| ----------------------------------------------- | ------------- | ---: | ---------: | ----: |
|
||||
| [`en_core_news_md`](/models/en#en_core_news_md) | v3.3.0 | 97.3 | 90.1 | 84.6 |
|
||||
| [`en_core_news_md`](/models/en#en_core_news_lg) | v3.4.0 | 97.2 | 90.3 | 85.5 |
|
||||
| [`en_core_news_lg`](/models/en#en_core_news_md) | v3.3.0 | 97.4 | 90.1 | 85.3 |
|
||||
| [`en_core_news_lg`](/models/en#en_core_news_lg) | v3.4.0 | 97.3 | 90.2 | 85.6 |
|
||||
|
||||
## Notes about upgrading from v3.3 {#upgrading}
|
||||
|
||||
### Doc.has_vector
|
||||
|
||||
`Doc.has_vector` now matches `Token.has_vector` and `Span.has_vector`: it
|
||||
returns `True` if at least one token in the doc has a vector rather than
|
||||
checking only whether the vocab contains vectors.
|
||||
|
||||
### Using trained pipelines with floret vectors
|
||||
|
||||
If you're using a trained pipeline for Croatian, Finnish, Korean or Swedish with
|
||||
new texts and working with `Doc` objects, you shouldn't notice any difference
|
||||
between floret vectors and default vectors.
|
||||
|
||||
If you use vectors for similarity comparisons, there are a few differences,
|
||||
mainly because a floret pipeline doesn't include any kind of frequency-based
|
||||
word list similar to the list of in-vocabulary vector keys with default vectors.
|
||||
|
||||
- If your workflow iterates over the vector keys, you should use an external
|
||||
word list instead:
|
||||
|
||||
```diff
|
||||
- lexemes = [nlp.vocab[orth] for orth in nlp.vocab.vectors]
|
||||
+ lexemes = [nlp.vocab[word] for word in external_word_list]
|
||||
```
|
||||
|
||||
- `Vectors.most_similar` is not supported because there's no fixed list of
|
||||
vectors to compare your vectors to.
|
||||
|
||||
### Pipeline package version compatibility {#version-compat}
|
||||
|
||||
> #### Using legacy implementations
|
||||
>
|
||||
> In spaCy v3, you'll still be able to load and reference legacy implementations
|
||||
> via [`spacy-legacy`](https://github.com/explosion/spacy-legacy), even if the
|
||||
> components or architectures change and newer versions are available in the
|
||||
> core library.
|
||||
|
||||
When you're loading a pipeline package trained with an earlier version of spaCy
|
||||
v3, you will see a warning telling you that the pipeline may be incompatible.
|
||||
This doesn't necessarily have to be true, but we recommend running your
|
||||
pipelines against your test suite or evaluation data to make sure there are no
|
||||
unexpected results.
|
||||
|
||||
If you're using one of the [trained pipelines](/models) we provide, you should
|
||||
run [`spacy download`](/api/cli#download) to update to the latest version. To
|
||||
see an overview of all installed packages and their compatibility, you can run
|
||||
[`spacy validate`](/api/cli#validate).
|
||||
|
||||
If you've trained your own custom pipeline and you've confirmed that it's still
|
||||
working as expected, you can update the spaCy version requirements in the
|
||||
[`meta.json`](/api/data-formats#meta):
|
||||
|
||||
```diff
|
||||
- "spacy_version": ">=3.3.0,<3.4.0",
|
||||
+ "spacy_version": ">=3.3.0,<3.5.0",
|
||||
```
|
||||
|
||||
### Updating v3.3 configs
|
||||
|
||||
To update a config from spaCy v3.3 with the new v3.4 settings, run
|
||||
[`init fill-config`](/api/cli#init-fill-config):
|
||||
|
||||
```cli
|
||||
$ python -m spacy init fill-config config-v3.3.cfg config-v3.4.cfg
|
||||
```
|
||||
|
||||
In many cases ([`spacy train`](/api/cli#train),
|
||||
[`spacy.load`](/api/top-level#spacy.load)), the new defaults will be filled in
|
||||
automatically, but you'll need to fill in the new settings to run
|
||||
[`debug config`](/api/cli#debug) and [`debug data`](/api/cli#debug-data).
|
|
@ -162,7 +162,12 @@
|
|||
{
|
||||
"code": "hr",
|
||||
"name": "Croatian",
|
||||
"has_examples": true
|
||||
"has_examples": true,
|
||||
"models": [
|
||||
"hr_core_news_sm",
|
||||
"hr_core_news_md",
|
||||
"hr_core_news_lg"
|
||||
]
|
||||
},
|
||||
{
|
||||
"code": "hsb",
|
||||
|
|
|
@ -12,7 +12,9 @@
|
|||
{ "text": "New in v3.0", "url": "/usage/v3" },
|
||||
{ "text": "New in v3.1", "url": "/usage/v3-1" },
|
||||
{ "text": "New in v3.2", "url": "/usage/v3-2" },
|
||||
{ "text": "New in v3.3", "url": "/usage/v3-3" }
|
||||
{ "text": "New in v3.2", "url": "/usage/v3-2" },
|
||||
{ "text": "New in v3.3", "url": "/usage/v3-3" },
|
||||
{ "text": "New in v3.4", "url": "/usage/v3-4" }
|
||||
]
|
||||
},
|
||||
{
|
||||
|
|
|
@ -120,8 +120,8 @@ const AlertSpace = ({ nightly, legacy }) => {
|
|||
}
|
||||
|
||||
const navAlert = (
|
||||
<Link to="/usage/v3-3" hidden>
|
||||
<strong>💥 Out now:</strong> spaCy v3.3
|
||||
<Link to="/usage/v3-4" hidden>
|
||||
<strong>💥 Out now:</strong> spaCy v3.4
|
||||
</Link>
|
||||
)
|
||||
|
||||
|
|
Loading…
Reference in New Issue
Block a user