mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-12 18:26:30 +03:00
Docs for v3.3 (#10628)
* Temporarily disable CI tests
* Start v3.3 website updates
* Add trainable lemmatizer to pipeline design
* Fix Vectors.most_similar
* Add floret vector info to pipeline design
* Add Lower and Upper Sorbian
* Add span to sidebar
* Work on release notes
* Copy from release notes
* Update pipeline design graphic
* Upgrading note about Doc.from_docs
* Add tables and details
* Update website/docs/models/index.md
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Fix da lemma acc
* Add minimal intro, various updates
* Round lemma acc
* Add section on floret / word lists
* Add new pipelines table, minor edits
* Fix displacy spans example title
* Clarify adding non-trainable lemmatizer
* Update adding-languages URLs
* Revert "Temporarily disable CI tests"
This reverts commit 1dee505920
.
* Spell out words/sec
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
This commit is contained in:
parent
10377fb945
commit
497a708c71
|
@ -621,7 +621,7 @@ relative clauses.
|
|||
|
||||
To customize the noun chunk iterator in a loaded pipeline, modify
|
||||
[`nlp.vocab.get_noun_chunks`](/api/vocab#attributes). If the `noun_chunk`
|
||||
[syntax iterator](/usage/adding-languages#language-data) has not been
|
||||
[syntax iterator](/usage/linguistic-features#language-data) has not been
|
||||
implemented for the given language, a `NotImplementedError` is raised.
|
||||
|
||||
> #### Example
|
||||
|
|
|
@ -283,8 +283,9 @@ objects, if the document has been syntactically parsed. A base noun phrase, or
|
|||
it – so no NP-level coordination, no prepositional phrases, and no relative
|
||||
clauses.
|
||||
|
||||
If the `noun_chunk` [syntax iterator](/usage/adding-languages#language-data) has
|
||||
not been implemeted for the given language, a `NotImplementedError` is raised.
|
||||
If the `noun_chunk` [syntax iterator](/usage/linguistic-features#language-data)
|
||||
has not been implemeted for the given language, a `NotImplementedError` is
|
||||
raised.
|
||||
|
||||
> #### Example
|
||||
>
|
||||
|
@ -520,12 +521,13 @@ sent = doc[sent.start : max(sent.end, span.end)]
|
|||
|
||||
## Span.sents {#sents tag="property" model="sentences" new="3.2.1"}
|
||||
|
||||
Returns a generator over the sentences the span belongs to. This property is only available
|
||||
when [sentence boundaries](/usage/linguistic-features#sbd) have been set on the
|
||||
document by the `parser`, `senter`, `sentencizer` or some custom function. It
|
||||
will raise an error otherwise.
|
||||
Returns a generator over the sentences the span belongs to. This property is
|
||||
only available when [sentence boundaries](/usage/linguistic-features#sbd) have
|
||||
been set on the document by the `parser`, `senter`, `sentencizer` or some custom
|
||||
function. It will raise an error otherwise.
|
||||
|
||||
If the span happens to cross sentence boundaries, all sentences the span overlaps with will be returned.
|
||||
If the span happens to cross sentence boundaries, all sentences the span
|
||||
overlaps with will be returned.
|
||||
|
||||
> #### Example
|
||||
>
|
||||
|
|
|
@ -347,14 +347,14 @@ supported for `floret` mode.
|
|||
> most_similar = nlp.vocab.vectors.most_similar(queries, n=10)
|
||||
> ```
|
||||
|
||||
| Name | Description |
|
||||
| -------------- | --------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------- |
|
||||
| `queries` | An array with one or more vectors. ~~numpy.ndarray~~ |
|
||||
| _keyword-only_ | |
|
||||
| `batch_size` | The batch size to use. Default to `1024`. ~~int~~ |
|
||||
| `n` | The number of entries to return for each query. Defaults to `1`. ~~int~~ |
|
||||
| `sort` | Whether to sort the entries returned by score. Defaults to `True`. ~~bool~~ |
|
||||
| **RETURNS** | tuple | The most similar entries as a `(keys, best_rows, scores)` tuple. ~~Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray]~~ |
|
||||
| Name | Description |
|
||||
| -------------- | ----------------------------------------------------------------------------------------------------------------------- |
|
||||
| `queries` | An array with one or more vectors. ~~numpy.ndarray~~ |
|
||||
| _keyword-only_ | |
|
||||
| `batch_size` | The batch size to use. Default to `1024`. ~~int~~ |
|
||||
| `n` | The number of entries to return for each query. Defaults to `1`. ~~int~~ |
|
||||
| `sort` | Whether to sort the entries returned by score. Defaults to `True`. ~~bool~~ |
|
||||
| **RETURNS** | The most similar entries as a `(keys, best_rows, scores)` tuple. ~~Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray]~~ |
|
||||
|
||||
## Vectors.get_batch {#get_batch tag="method" new="3.2"}
|
||||
|
||||
|
|
File diff suppressed because one or more lines are too long
Before Width: | Height: | Size: 27 KiB After Width: | Height: | Size: 108 KiB |
|
@ -30,10 +30,16 @@ into three components:
|
|||
tagging, parsing, lemmatization and named entity recognition, or `dep` for
|
||||
only tagging, parsing and lemmatization).
|
||||
2. **Genre:** Type of text the pipeline is trained on, e.g. `web` or `news`.
|
||||
3. **Size:** Package size indicator, `sm`, `md`, `lg` or `trf` (`sm`: no word
|
||||
vectors, `md`: reduced word vector table with 20k unique vectors for ~500k
|
||||
words, `lg`: large word vector table with ~500k entries, `trf`: transformer
|
||||
pipeline without static word vectors)
|
||||
3. **Size:** Package size indicator, `sm`, `md`, `lg` or `trf`.
|
||||
|
||||
`sm` and `trf` pipelines have no static word vectors.
|
||||
|
||||
For pipelines with default vectors, `md` has a reduced word vector table with
|
||||
20k unique vectors for ~500k words and `lg` has a large word vector table
|
||||
with ~500k entries.
|
||||
|
||||
For pipelines with floret vectors, `md` vector tables have 50k entries and
|
||||
`lg` vector tables have 200k entries.
|
||||
|
||||
For example, [`en_core_web_sm`](/models/en#en_core_web_sm) is a small English
|
||||
pipeline trained on written web text (blogs, news, comments), that includes
|
||||
|
@ -90,19 +96,42 @@ Main changes from spaCy v2 models:
|
|||
In the `sm`/`md`/`lg` models:
|
||||
|
||||
- The `tagger`, `morphologizer` and `parser` components listen to the `tok2vec`
|
||||
component.
|
||||
component. If the lemmatizer is trainable (v3.3+), `lemmatizer` also listens
|
||||
to `tok2vec`.
|
||||
- The `attribute_ruler` maps `token.tag` to `token.pos` if there is no
|
||||
`morphologizer`. The `attribute_ruler` additionally makes sure whitespace is
|
||||
tagged consistently and copies `token.pos` to `token.tag` if there is no
|
||||
tagger. For English, the attribute ruler can improve its mapping from
|
||||
`token.tag` to `token.pos` if dependency parses from a `parser` are present,
|
||||
but the parser is not required.
|
||||
- The `lemmatizer` component for many languages (Catalan, Dutch, English,
|
||||
French, Greek, Italian Macedonian, Norwegian, Polish and Spanish) requires
|
||||
`token.pos` annotation from either `tagger`+`attribute_ruler` or
|
||||
`morphologizer`.
|
||||
- The `lemmatizer` component for many languages requires `token.pos` annotation
|
||||
from either `tagger`+`attribute_ruler` or `morphologizer`.
|
||||
- The `ner` component is independent with its own internal tok2vec layer.
|
||||
|
||||
#### CNN/CPU pipelines with floret vectors
|
||||
|
||||
The Finnish, Korean and Swedish `md` and `lg` pipelines use
|
||||
[floret vectors](/usage/v3-2#vectors) instead of default vectors. If you're
|
||||
running a trained pipeline on texts and working with [`Doc`](/api/doc) objects,
|
||||
you shouldn't notice any difference with floret vectors. With floret vectors no
|
||||
tokens are out-of-vocabulary, so [`Token.is_oov`](/api/token#attributes) will
|
||||
return `True` for all tokens.
|
||||
|
||||
If you access vectors directly for similarity comparisons, there are a few
|
||||
differences because floret vectors don't include a fixed word list like the
|
||||
vector keys for default vectors.
|
||||
|
||||
- If your workflow iterates over the vector keys, you need to use an external
|
||||
word list instead:
|
||||
|
||||
```diff
|
||||
- lexemes = [nlp.vocab[orth] for orth in nlp.vocab.vectors]
|
||||
+ lexemes = [nlp.vocab[word] for word in external_word_list]
|
||||
```
|
||||
|
||||
- [`Vectors.most_similar`](/api/vectors#most_similar) is not supported because
|
||||
there's no fixed list of vectors to compare your vectors to.
|
||||
|
||||
### Transformer pipeline design {#design-trf}
|
||||
|
||||
In the transformer (`trf`) models, the `tagger`, `parser` and `ner` (if present)
|
||||
|
@ -133,10 +162,14 @@ nlp = spacy.load("en_core_web_trf", disable=["tagger", "attribute_ruler", "lemma
|
|||
<Infobox variant="warning" title="Rule-based and POS-lookup lemmatizers require
|
||||
Token.pos">
|
||||
|
||||
The lemmatizer depends on `tagger`+`attribute_ruler` or `morphologizer` for
|
||||
Catalan, Dutch, English, French, Greek, Italian, Macedonian, Norwegian, Polish
|
||||
and Spanish. If you disable any of these components, you'll see lemmatizer
|
||||
warnings unless the lemmatizer is also disabled.
|
||||
The lemmatizer depends on `tagger`+`attribute_ruler` or `morphologizer` for a
|
||||
number of languages. If you disable any of these components, you'll see
|
||||
lemmatizer warnings unless the lemmatizer is also disabled.
|
||||
|
||||
**v3.3**: Catalan, English, French, Russian and Spanish
|
||||
|
||||
**v3.0-v3.2**: Catalan, Dutch, English, French, Greek, Italian, Macedonian,
|
||||
Norwegian, Polish, Russian and Spanish
|
||||
|
||||
</Infobox>
|
||||
|
||||
|
@ -154,10 +187,34 @@ nlp.enable_pipe("senter")
|
|||
The `senter` component is ~10× faster than the parser and more accurate
|
||||
than the rule-based `sentencizer`.
|
||||
|
||||
#### Switch from trainable lemmatizer to default lemmatizer
|
||||
|
||||
Since v3.3, a number of pipelines use a trainable lemmatizer. You can check whether
|
||||
the lemmatizer is trainable:
|
||||
|
||||
```python
|
||||
nlp = spacy.load("de_core_web_sm")
|
||||
assert nlp.get_pipe("lemmatizer").is_trainable
|
||||
```
|
||||
|
||||
If you'd like to switch to a non-trainable lemmatizer that's similar to v3.2 or
|
||||
earlier, you can replace the trainable lemmatizer with the default non-trainable
|
||||
lemmatizer:
|
||||
|
||||
```python
|
||||
# Requirements: pip install spacy-lookups-data
|
||||
nlp = spacy.load("de_core_web_sm")
|
||||
# Remove existing lemmatizer
|
||||
nlp.remove_pipe("lemmatizer")
|
||||
# Add non-trainable lemmatizer from language defaults
|
||||
# and load lemmatizer tables from spacy-lookups-data
|
||||
nlp.add_pipe("lemmatizer").initialize()
|
||||
```
|
||||
|
||||
#### Switch from rule-based to lookup lemmatization
|
||||
|
||||
For the Dutch, English, French, Greek, Macedonian, Norwegian and Spanish
|
||||
pipelines, you can switch from the default rule-based lemmatizer to a lookup
|
||||
pipelines, you can swap out a trainable or rule-based lemmatizer for a lookup
|
||||
lemmatizer:
|
||||
|
||||
```python
|
||||
|
|
247
website/docs/usage/v3-3.md
Normal file
247
website/docs/usage/v3-3.md
Normal file
|
@ -0,0 +1,247 @@
|
|||
---
|
||||
title: What's New in v3.3
|
||||
teaser: New features and how to upgrade
|
||||
menu:
|
||||
- ['New Features', 'features']
|
||||
- ['Upgrading Notes', 'upgrading']
|
||||
---
|
||||
|
||||
## New features {#features hidden="true"}
|
||||
|
||||
spaCy v3.3 improves the speed of core pipeline components, adds a new trainable
|
||||
lemmatizer, and introduces trained pipelines for Finnish, Korean and Swedish.
|
||||
|
||||
### Speed improvements {#speed}
|
||||
|
||||
v3.3 includes a slew of speed improvements:
|
||||
|
||||
- Speed up parser and NER by using constant-time head lookups.
|
||||
- Support unnormalized softmax probabilities in `spacy.Tagger.v2` to speed up
|
||||
inference for tagger, morphologizer, senter and trainable lemmatizer.
|
||||
- Speed up parser projectivization functions.
|
||||
- Replace `Ragged` with faster `AlignmentArray` in `Example` for training.
|
||||
- Improve `Matcher` speed.
|
||||
- Improve serialization speed for empty `Doc.spans`.
|
||||
|
||||
For longer texts, the trained pipeline speeds improve **15%** or more in
|
||||
prediction. We benchmarked `en_core_web_md` (same components as in v3.2) and
|
||||
`de_core_news_md` (with the new trainable lemmatizer) across a range of text
|
||||
sizes on Linux (Intel Xeon W-2265) and OS X (M1) to compare spaCy v3.2 vs. v3.3:
|
||||
|
||||
**Intel Xeon W-2265**
|
||||
|
||||
| Model | Avg. Words/Doc | v3.2 Words/Sec | v3.3 Words/Sec | Diff |
|
||||
| :----------------------------------------------- | -------------: | -------------: | -------------: | -----: |
|
||||
| [`en_core_web_md`](/models/en#en_core_web_md) | 100 | 17292 | 17441 | 0.86% |
|
||||
| (=same components) | 1000 | 15408 | 16024 | 4.00% |
|
||||
| | 10000 | 12798 | 15346 | 19.91% |
|
||||
| [`de_core_news_md`](/models/de/#de_core_news_md) | 100 | 20221 | 19321 | -4.45% |
|
||||
| (+v3.3 trainable lemmatizer) | 1000 | 17480 | 17345 | -0.77% |
|
||||
| | 10000 | 14513 | 17036 | 17.38% |
|
||||
|
||||
**Apple M1**
|
||||
|
||||
| Model | Avg. Words/Doc | v3.2 Words/Sec | v3.3 Words/Sec | Diff |
|
||||
| ------------------------------------------------ | -------------: | -------------: | -------------: | -----: |
|
||||
| [`en_core_web_md`](/models/en#en_core_web_md) | 100 | 18272 | 18408 | 0.74% |
|
||||
| (=same components) | 1000 | 18794 | 19248 | 2.42% |
|
||||
| | 10000 | 15144 | 17513 | 15.64% |
|
||||
| [`de_core_news_md`](/models/de/#de_core_news_md) | 100 | 19227 | 19591 | 1.89% |
|
||||
| (+v3.3 trainable lemmatizer) | 1000 | 20047 | 20628 | 2.90% |
|
||||
| | 10000 | 15921 | 18546 | 16.49% |
|
||||
|
||||
### Trainable lemmatizer {#trainable-lemmatizer}
|
||||
|
||||
The new [trainable lemmatizer](/api/edittreelemmatizer) component uses
|
||||
[edit trees](https://explosion.ai/blog/edit-tree-lemmatizer) to transform tokens
|
||||
into lemmas. Try out the trainable lemmatizer with the
|
||||
[training quickstart](/usage/training#quickstart)!
|
||||
|
||||
### displaCy support for overlapping spans and arcs {#displacy}
|
||||
|
||||
displaCy now supports overlapping spans with a new
|
||||
[`span`](/usage/visualizers#span) style and multiple arcs with different labels
|
||||
between the same tokens for [`dep`](/usage/visualizers#dep) visualizations.
|
||||
|
||||
Overlapping spans can be visualized for any spans key in `doc.spans`:
|
||||
|
||||
```python
|
||||
import spacy
|
||||
from spacy import displacy
|
||||
from spacy.tokens import Span
|
||||
|
||||
nlp = spacy.blank("en")
|
||||
text = "Welcome to the Bank of China."
|
||||
doc = nlp(text)
|
||||
doc.spans["custom"] = [Span(doc, 3, 6, "ORG"), Span(doc, 5, 6, "GPE")]
|
||||
displacy.serve(doc, style="span", options={"spans_key": "custom"})
|
||||
```
|
||||
|
||||
import DisplacySpanHtml from 'images/displacy-span.html'
|
||||
|
||||
<Iframe title="displaCy visualizer for overlapping spans" html={DisplacySpanHtml} height={180} />
|
||||
|
||||
## Additional features and improvements
|
||||
|
||||
- Config comparisons with [`spacy debug diff-config`](/api/cli#debug-diff).
|
||||
- Span suggester debugging with
|
||||
[`SpanCategorizer.set_candidates`](/api/spancategorizer#set_candidates).
|
||||
- Big endian support with
|
||||
[`thinc-bigendian-ops`](https://github.com/andrewsi-z/thinc-bigendian-ops) and
|
||||
updates to make `floret`, `murmurhash`, Thinc and spaCy endian neutral.
|
||||
- Initial support for Lower Sorbian and Upper Sorbian.
|
||||
- Language updates for English, French, Italian, Japanese, Korean, Norwegian,
|
||||
Russian, Slovenian, Spanish, Turkish, Ukrainian and Vietnamese.
|
||||
- New noun chunks for Finnish.
|
||||
|
||||
## Trained pipelines {#pipelines}
|
||||
|
||||
### New trained pipelines {#new-pipelines}
|
||||
|
||||
v3.3 introduces new CPU/CNN pipelines for Finnish, Korean and Swedish, which use
|
||||
the new trainable lemmatizer and
|
||||
[floret vectors](https://github.com/explosion/floret). Due to the use
|
||||
[Bloom embeddings](https://explosion.ai/blog/bloom-embeddings) and subwords, the
|
||||
pipelines have compact vectors with no out-of-vocabulary words.
|
||||
|
||||
| Package | Language | UPOS | Parser LAS | NER F |
|
||||
| ----------------------------------------------- | -------- | ---: | ---------: | ----: |
|
||||
| [`fi_core_news_sm`](/models/fi#fi_core_news_sm) | Finnish | 92.5 | 71.9 | 75.9 |
|
||||
| [`fi_core_news_md`](/models/fi#fi_core_news_md) | Finnish | 95.9 | 78.6 | 80.6 |
|
||||
| [`fi_core_news_lg`](/models/fi#fi_core_news_lg) | Finnish | 96.2 | 79.4 | 82.4 |
|
||||
| [`ko_core_news_sm`](/models/ko#ko_core_news_sm) | Korean | 86.1 | 65.6 | 71.3 |
|
||||
| [`ko_core_news_md`](/models/ko#ko_core_news_md) | Korean | 94.7 | 80.9 | 83.1 |
|
||||
| [`ko_core_news_lg`](/models/ko#ko_core_news_lg) | Korean | 94.7 | 81.3 | 85.3 |
|
||||
| [`sv_core_news_sm`](/models/sv#sv_core_news_sm) | Swedish | 95.0 | 75.9 | 74.7 |
|
||||
| [`sv_core_news_md`](/models/sv#sv_core_news_md) | Swedish | 96.3 | 78.5 | 79.3 |
|
||||
| [`sv_core_news_lg`](/models/sv#sv_core_news_lg) | Swedish | 96.3 | 79.1 | 81.1 |
|
||||
|
||||
### Pipeline updates {#pipeline-updates}
|
||||
|
||||
The following languages switch from lookup or rule-based lemmatizers to the new
|
||||
trainable lemmatizer: Danish, Dutch, German, Greek, Italian, Lithuanian,
|
||||
Norwegian, Polish, Portuguese and Romanian. The overall lemmatizer accuracy
|
||||
improves for all of these pipelines, but be aware that the types of errors may
|
||||
look quite different from the lookup-based lemmatizers. If you'd prefer to
|
||||
continue using the previous lemmatizer, you can
|
||||
[switch from the trainable lemmatizer to a non-trainable lemmatizer](/models#design-modify).
|
||||
|
||||
<figure>
|
||||
|
||||
| Model | v3.2 Lemma Acc | v3.3 Lemma Acc |
|
||||
| ----------------------------------------------- | -------------: | -------------: |
|
||||
| [`da_core_news_md`](/models/da#da_core_news_md) | 84.9 | 94.8 |
|
||||
| [`de_core_news_md`](/models/de#de_core_news_md) | 73.4 | 97.7 |
|
||||
| [`el_core_news_md`](/models/el#el_core_news_md) | 56.5 | 88.9 |
|
||||
| [`fi_core_news_md`](/models/fi#fi_core_news_md) | - | 86.2 |
|
||||
| [`it_core_news_md`](/models/it#it_core_news_md) | 86.6 | 97.2 |
|
||||
| [`ko_core_news_md`](/models/ko#ko_core_news_md) | - | 90.0 |
|
||||
| [`lt_core_news_md`](/models/lt#lt_core_news_md) | 71.1 | 84.8 |
|
||||
| [`nb_core_news_md`](/models/nb#nb_core_news_md) | 76.7 | 97.1 |
|
||||
| [`nl_core_news_md`](/models/nl#nl_core_news_md) | 81.5 | 94.0 |
|
||||
| [`pl_core_news_md`](/models/pl#pl_core_news_md) | 87.1 | 93.7 |
|
||||
| [`pt_core_news_md`](/models/pt#pt_core_news_md) | 76.7 | 96.9 |
|
||||
| [`ro_core_news_md`](/models/ro#ro_core_news_md) | 81.8 | 95.5 |
|
||||
| [`sv_core_news_md`](/models/sv#sv_core_news_md) | - | 95.5 |
|
||||
|
||||
</figure>
|
||||
|
||||
In addition, the vectors in the English pipelines are deduplicated to improve
|
||||
the pruned vectors in the `md` models and reduce the `lg` model size.
|
||||
|
||||
## Notes about upgrading from v3.2 {#upgrading}
|
||||
|
||||
### Span comparisons
|
||||
|
||||
Span comparisons involving ordering (`<`, `<=`, `>`, `>=`) now take all span
|
||||
attributes into account (start, end, label, and KB ID) so spans may be sorted in
|
||||
a slightly different order.
|
||||
|
||||
### Whitespace annotation
|
||||
|
||||
During training, annotation on whitespace tokens is handled in the same way as
|
||||
annotation on non-whitespace tokens in order to allow custom whitespace
|
||||
annotation.
|
||||
|
||||
### Doc.from_docs
|
||||
|
||||
[`Doc.from_docs`](/api/doc#from_docs) now includes `Doc.tensor` by default and
|
||||
supports excludes with an `exclude` argument in the same format as
|
||||
`Doc.to_bytes`. The supported exclude fields are `spans`, `tensor` and
|
||||
`user_data`.
|
||||
|
||||
Docs including `Doc.tensor` may be quite a bit larger in RAM, so to exclude
|
||||
`Doc.tensor` as in v3.2:
|
||||
|
||||
```diff
|
||||
-merged_doc = Doc.from_docs(docs)
|
||||
+merged_doc = Doc.from_docs(docs, exclude=["tensor"])
|
||||
```
|
||||
|
||||
### Using trained pipelines with floret vectors
|
||||
|
||||
If you're running a new trained pipeline for Finnish, Korean or Swedish on new
|
||||
texts and working with `Doc` objects, you shouldn't notice any difference with
|
||||
floret vectors vs. default vectors.
|
||||
|
||||
If you use vectors for similarity comparisons, there are a few differences,
|
||||
mainly because a floret pipeline doesn't include any kind of frequency-based
|
||||
word list similar to the list of in-vocabulary vector keys with default vectors.
|
||||
|
||||
- If your workflow iterates over the vector keys, you should use an external
|
||||
word list instead:
|
||||
|
||||
```diff
|
||||
- lexemes = [nlp.vocab[orth] for orth in nlp.vocab.vectors]
|
||||
+ lexemes = [nlp.vocab[word] for word in external_word_list]
|
||||
```
|
||||
|
||||
- `Vectors.most_similar` is not supported because there's no fixed list of
|
||||
vectors to compare your vectors to.
|
||||
|
||||
### Pipeline package version compatibility {#version-compat}
|
||||
|
||||
> #### Using legacy implementations
|
||||
>
|
||||
> In spaCy v3, you'll still be able to load and reference legacy implementations
|
||||
> via [`spacy-legacy`](https://github.com/explosion/spacy-legacy), even if the
|
||||
> components or architectures change and newer versions are available in the
|
||||
> core library.
|
||||
|
||||
When you're loading a pipeline package trained with an earlier version of spaCy
|
||||
v3, you will see a warning telling you that the pipeline may be incompatible.
|
||||
This doesn't necessarily have to be true, but we recommend running your
|
||||
pipelines against your test suite or evaluation data to make sure there are no
|
||||
unexpected results.
|
||||
|
||||
If you're using one of the [trained pipelines](/models) we provide, you should
|
||||
run [`spacy download`](/api/cli#download) to update to the latest version. To
|
||||
see an overview of all installed packages and their compatibility, you can run
|
||||
[`spacy validate`](/api/cli#validate).
|
||||
|
||||
If you've trained your own custom pipeline and you've confirmed that it's still
|
||||
working as expected, you can update the spaCy version requirements in the
|
||||
[`meta.json`](/api/data-formats#meta):
|
||||
|
||||
```diff
|
||||
- "spacy_version": ">=3.2.0,<3.3.0",
|
||||
+ "spacy_version": ">=3.2.0,<3.4.0",
|
||||
```
|
||||
|
||||
### Updating v3.2 configs
|
||||
|
||||
To update a config from spaCy v3.2 with the new v3.3 settings, run
|
||||
[`init fill-config`](/api/cli#init-fill-config):
|
||||
|
||||
```cli
|
||||
$ python -m spacy init fill-config config-v3.2.cfg config-v3.3.cfg
|
||||
```
|
||||
|
||||
In many cases ([`spacy train`](/api/cli#train),
|
||||
[`spacy.load`](/api/top-level#spacy.load)), the new defaults will be filled in
|
||||
automatically, but you'll need to fill in the new settings to run
|
||||
[`debug config`](/api/cli#debug) and [`debug data`](/api/cli#debug-data).
|
||||
|
||||
To see the speed improvements for the
|
||||
[`Tagger` architecture](/api/architectures#Tagger), edit your config to switch
|
||||
from `spacy.Tagger.v1` to `spacy.Tagger.v2` and then run `init fill-config`.
|
|
@ -5,6 +5,7 @@ new: 2
|
|||
menu:
|
||||
- ['Dependencies', 'dep']
|
||||
- ['Named Entities', 'ent']
|
||||
- ['Spans', 'span']
|
||||
- ['Jupyter Notebooks', 'jupyter']
|
||||
- ['Rendering HTML', 'html']
|
||||
- ['Web app usage', 'webapp']
|
||||
|
@ -192,7 +193,7 @@ displacy.serve(doc, style="span")
|
|||
|
||||
import DisplacySpanHtml from 'images/displacy-span.html'
|
||||
|
||||
<Iframe title="displaCy visualizer for entities" html={DisplacySpanHtml} height={180} />
|
||||
<Iframe title="displaCy visualizer for overlapping spans" html={DisplacySpanHtml} height={180} />
|
||||
|
||||
|
||||
The span visualizer lets you customize the following `options`:
|
||||
|
|
|
@ -62,6 +62,11 @@
|
|||
"example": "Dies ist ein Satz.",
|
||||
"has_examples": true
|
||||
},
|
||||
{
|
||||
"code": "dsb",
|
||||
"name": "Lower Sorbian",
|
||||
"has_examples": true
|
||||
},
|
||||
{
|
||||
"code": "el",
|
||||
"name": "Greek",
|
||||
|
@ -159,6 +164,11 @@
|
|||
"name": "Croatian",
|
||||
"has_examples": true
|
||||
},
|
||||
{
|
||||
"code": "hsb",
|
||||
"name": "Upper Sorbian",
|
||||
"has_examples": true
|
||||
},
|
||||
{
|
||||
"code": "hu",
|
||||
"name": "Hungarian",
|
||||
|
|
|
@ -11,7 +11,8 @@
|
|||
{ "text": "spaCy 101", "url": "/usage/spacy-101" },
|
||||
{ "text": "New in v3.0", "url": "/usage/v3" },
|
||||
{ "text": "New in v3.1", "url": "/usage/v3-1" },
|
||||
{ "text": "New in v3.2", "url": "/usage/v3-2" }
|
||||
{ "text": "New in v3.2", "url": "/usage/v3-2" },
|
||||
{ "text": "New in v3.3", "url": "/usage/v3-3" }
|
||||
]
|
||||
},
|
||||
{
|
||||
|
|
|
@ -120,8 +120,8 @@ const AlertSpace = ({ nightly, legacy }) => {
|
|||
}
|
||||
|
||||
const navAlert = (
|
||||
<Link to="/usage/v3-2" hidden>
|
||||
<strong>💥 Out now:</strong> spaCy v3.2
|
||||
<Link to="/usage/v3-3" hidden>
|
||||
<strong>💥 Out now:</strong> spaCy v3.3
|
||||
</Link>
|
||||
)
|
||||
|
||||
|
|
Loading…
Reference in New Issue
Block a user