mirror of
https://github.com/explosion/spaCy.git
synced 2024-12-25 17:36:30 +03:00
Docs for v3.3 (#10628)
* Temporarily disable CI tests
* Start v3.3 website updates
* Add trainable lemmatizer to pipeline design
* Fix Vectors.most_similar
* Add floret vector info to pipeline design
* Add Lower and Upper Sorbian
* Add span to sidebar
* Work on release notes
* Copy from release notes
* Update pipeline design graphic
* Upgrading note about Doc.from_docs
* Add tables and details
* Update website/docs/models/index.md
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Fix da lemma acc
* Add minimal intro, various updates
* Round lemma acc
* Add section on floret / word lists
* Add new pipelines table, minor edits
* Fix displacy spans example title
* Clarify adding non-trainable lemmatizer
* Update adding-languages URLs
* Revert "Temporarily disable CI tests"
This reverts commit 1dee505920
.
* Spell out words/sec
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
This commit is contained in:
parent
10377fb945
commit
497a708c71
|
@ -621,7 +621,7 @@ relative clauses.
|
||||||
|
|
||||||
To customize the noun chunk iterator in a loaded pipeline, modify
|
To customize the noun chunk iterator in a loaded pipeline, modify
|
||||||
[`nlp.vocab.get_noun_chunks`](/api/vocab#attributes). If the `noun_chunk`
|
[`nlp.vocab.get_noun_chunks`](/api/vocab#attributes). If the `noun_chunk`
|
||||||
[syntax iterator](/usage/adding-languages#language-data) has not been
|
[syntax iterator](/usage/linguistic-features#language-data) has not been
|
||||||
implemented for the given language, a `NotImplementedError` is raised.
|
implemented for the given language, a `NotImplementedError` is raised.
|
||||||
|
|
||||||
> #### Example
|
> #### Example
|
||||||
|
|
|
@ -283,8 +283,9 @@ objects, if the document has been syntactically parsed. A base noun phrase, or
|
||||||
it – so no NP-level coordination, no prepositional phrases, and no relative
|
it – so no NP-level coordination, no prepositional phrases, and no relative
|
||||||
clauses.
|
clauses.
|
||||||
|
|
||||||
If the `noun_chunk` [syntax iterator](/usage/adding-languages#language-data) has
|
If the `noun_chunk` [syntax iterator](/usage/linguistic-features#language-data)
|
||||||
not been implemeted for the given language, a `NotImplementedError` is raised.
|
has not been implemeted for the given language, a `NotImplementedError` is
|
||||||
|
raised.
|
||||||
|
|
||||||
> #### Example
|
> #### Example
|
||||||
>
|
>
|
||||||
|
@ -520,12 +521,13 @@ sent = doc[sent.start : max(sent.end, span.end)]
|
||||||
|
|
||||||
## Span.sents {#sents tag="property" model="sentences" new="3.2.1"}
|
## Span.sents {#sents tag="property" model="sentences" new="3.2.1"}
|
||||||
|
|
||||||
Returns a generator over the sentences the span belongs to. This property is only available
|
Returns a generator over the sentences the span belongs to. This property is
|
||||||
when [sentence boundaries](/usage/linguistic-features#sbd) have been set on the
|
only available when [sentence boundaries](/usage/linguistic-features#sbd) have
|
||||||
document by the `parser`, `senter`, `sentencizer` or some custom function. It
|
been set on the document by the `parser`, `senter`, `sentencizer` or some custom
|
||||||
will raise an error otherwise.
|
function. It will raise an error otherwise.
|
||||||
|
|
||||||
If the span happens to cross sentence boundaries, all sentences the span overlaps with will be returned.
|
If the span happens to cross sentence boundaries, all sentences the span
|
||||||
|
overlaps with will be returned.
|
||||||
|
|
||||||
> #### Example
|
> #### Example
|
||||||
>
|
>
|
||||||
|
|
|
@ -347,14 +347,14 @@ supported for `floret` mode.
|
||||||
> most_similar = nlp.vocab.vectors.most_similar(queries, n=10)
|
> most_similar = nlp.vocab.vectors.most_similar(queries, n=10)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Description |
|
| Name | Description |
|
||||||
| -------------- | --------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------- |
|
| -------------- | ----------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `queries` | An array with one or more vectors. ~~numpy.ndarray~~ |
|
| `queries` | An array with one or more vectors. ~~numpy.ndarray~~ |
|
||||||
| _keyword-only_ | |
|
| _keyword-only_ | |
|
||||||
| `batch_size` | The batch size to use. Default to `1024`. ~~int~~ |
|
| `batch_size` | The batch size to use. Default to `1024`. ~~int~~ |
|
||||||
| `n` | The number of entries to return for each query. Defaults to `1`. ~~int~~ |
|
| `n` | The number of entries to return for each query. Defaults to `1`. ~~int~~ |
|
||||||
| `sort` | Whether to sort the entries returned by score. Defaults to `True`. ~~bool~~ |
|
| `sort` | Whether to sort the entries returned by score. Defaults to `True`. ~~bool~~ |
|
||||||
| **RETURNS** | tuple | The most similar entries as a `(keys, best_rows, scores)` tuple. ~~Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray]~~ |
|
| **RETURNS** | The most similar entries as a `(keys, best_rows, scores)` tuple. ~~Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray]~~ |
|
||||||
|
|
||||||
## Vectors.get_batch {#get_batch tag="method" new="3.2"}
|
## Vectors.get_batch {#get_batch tag="method" new="3.2"}
|
||||||
|
|
||||||
|
|
File diff suppressed because one or more lines are too long
Before Width: | Height: | Size: 27 KiB After Width: | Height: | Size: 108 KiB |
|
@ -30,10 +30,16 @@ into three components:
|
||||||
tagging, parsing, lemmatization and named entity recognition, or `dep` for
|
tagging, parsing, lemmatization and named entity recognition, or `dep` for
|
||||||
only tagging, parsing and lemmatization).
|
only tagging, parsing and lemmatization).
|
||||||
2. **Genre:** Type of text the pipeline is trained on, e.g. `web` or `news`.
|
2. **Genre:** Type of text the pipeline is trained on, e.g. `web` or `news`.
|
||||||
3. **Size:** Package size indicator, `sm`, `md`, `lg` or `trf` (`sm`: no word
|
3. **Size:** Package size indicator, `sm`, `md`, `lg` or `trf`.
|
||||||
vectors, `md`: reduced word vector table with 20k unique vectors for ~500k
|
|
||||||
words, `lg`: large word vector table with ~500k entries, `trf`: transformer
|
`sm` and `trf` pipelines have no static word vectors.
|
||||||
pipeline without static word vectors)
|
|
||||||
|
For pipelines with default vectors, `md` has a reduced word vector table with
|
||||||
|
20k unique vectors for ~500k words and `lg` has a large word vector table
|
||||||
|
with ~500k entries.
|
||||||
|
|
||||||
|
For pipelines with floret vectors, `md` vector tables have 50k entries and
|
||||||
|
`lg` vector tables have 200k entries.
|
||||||
|
|
||||||
For example, [`en_core_web_sm`](/models/en#en_core_web_sm) is a small English
|
For example, [`en_core_web_sm`](/models/en#en_core_web_sm) is a small English
|
||||||
pipeline trained on written web text (blogs, news, comments), that includes
|
pipeline trained on written web text (blogs, news, comments), that includes
|
||||||
|
@ -90,19 +96,42 @@ Main changes from spaCy v2 models:
|
||||||
In the `sm`/`md`/`lg` models:
|
In the `sm`/`md`/`lg` models:
|
||||||
|
|
||||||
- The `tagger`, `morphologizer` and `parser` components listen to the `tok2vec`
|
- The `tagger`, `morphologizer` and `parser` components listen to the `tok2vec`
|
||||||
component.
|
component. If the lemmatizer is trainable (v3.3+), `lemmatizer` also listens
|
||||||
|
to `tok2vec`.
|
||||||
- The `attribute_ruler` maps `token.tag` to `token.pos` if there is no
|
- The `attribute_ruler` maps `token.tag` to `token.pos` if there is no
|
||||||
`morphologizer`. The `attribute_ruler` additionally makes sure whitespace is
|
`morphologizer`. The `attribute_ruler` additionally makes sure whitespace is
|
||||||
tagged consistently and copies `token.pos` to `token.tag` if there is no
|
tagged consistently and copies `token.pos` to `token.tag` if there is no
|
||||||
tagger. For English, the attribute ruler can improve its mapping from
|
tagger. For English, the attribute ruler can improve its mapping from
|
||||||
`token.tag` to `token.pos` if dependency parses from a `parser` are present,
|
`token.tag` to `token.pos` if dependency parses from a `parser` are present,
|
||||||
but the parser is not required.
|
but the parser is not required.
|
||||||
- The `lemmatizer` component for many languages (Catalan, Dutch, English,
|
- The `lemmatizer` component for many languages requires `token.pos` annotation
|
||||||
French, Greek, Italian Macedonian, Norwegian, Polish and Spanish) requires
|
from either `tagger`+`attribute_ruler` or `morphologizer`.
|
||||||
`token.pos` annotation from either `tagger`+`attribute_ruler` or
|
|
||||||
`morphologizer`.
|
|
||||||
- The `ner` component is independent with its own internal tok2vec layer.
|
- The `ner` component is independent with its own internal tok2vec layer.
|
||||||
|
|
||||||
|
#### CNN/CPU pipelines with floret vectors
|
||||||
|
|
||||||
|
The Finnish, Korean and Swedish `md` and `lg` pipelines use
|
||||||
|
[floret vectors](/usage/v3-2#vectors) instead of default vectors. If you're
|
||||||
|
running a trained pipeline on texts and working with [`Doc`](/api/doc) objects,
|
||||||
|
you shouldn't notice any difference with floret vectors. With floret vectors no
|
||||||
|
tokens are out-of-vocabulary, so [`Token.is_oov`](/api/token#attributes) will
|
||||||
|
return `True` for all tokens.
|
||||||
|
|
||||||
|
If you access vectors directly for similarity comparisons, there are a few
|
||||||
|
differences because floret vectors don't include a fixed word list like the
|
||||||
|
vector keys for default vectors.
|
||||||
|
|
||||||
|
- If your workflow iterates over the vector keys, you need to use an external
|
||||||
|
word list instead:
|
||||||
|
|
||||||
|
```diff
|
||||||
|
- lexemes = [nlp.vocab[orth] for orth in nlp.vocab.vectors]
|
||||||
|
+ lexemes = [nlp.vocab[word] for word in external_word_list]
|
||||||
|
```
|
||||||
|
|
||||||
|
- [`Vectors.most_similar`](/api/vectors#most_similar) is not supported because
|
||||||
|
there's no fixed list of vectors to compare your vectors to.
|
||||||
|
|
||||||
### Transformer pipeline design {#design-trf}
|
### Transformer pipeline design {#design-trf}
|
||||||
|
|
||||||
In the transformer (`trf`) models, the `tagger`, `parser` and `ner` (if present)
|
In the transformer (`trf`) models, the `tagger`, `parser` and `ner` (if present)
|
||||||
|
@ -133,10 +162,14 @@ nlp = spacy.load("en_core_web_trf", disable=["tagger", "attribute_ruler", "lemma
|
||||||
<Infobox variant="warning" title="Rule-based and POS-lookup lemmatizers require
|
<Infobox variant="warning" title="Rule-based and POS-lookup lemmatizers require
|
||||||
Token.pos">
|
Token.pos">
|
||||||
|
|
||||||
The lemmatizer depends on `tagger`+`attribute_ruler` or `morphologizer` for
|
The lemmatizer depends on `tagger`+`attribute_ruler` or `morphologizer` for a
|
||||||
Catalan, Dutch, English, French, Greek, Italian, Macedonian, Norwegian, Polish
|
number of languages. If you disable any of these components, you'll see
|
||||||
and Spanish. If you disable any of these components, you'll see lemmatizer
|
lemmatizer warnings unless the lemmatizer is also disabled.
|
||||||
warnings unless the lemmatizer is also disabled.
|
|
||||||
|
**v3.3**: Catalan, English, French, Russian and Spanish
|
||||||
|
|
||||||
|
**v3.0-v3.2**: Catalan, Dutch, English, French, Greek, Italian, Macedonian,
|
||||||
|
Norwegian, Polish, Russian and Spanish
|
||||||
|
|
||||||
</Infobox>
|
</Infobox>
|
||||||
|
|
||||||
|
@ -154,10 +187,34 @@ nlp.enable_pipe("senter")
|
||||||
The `senter` component is ~10× faster than the parser and more accurate
|
The `senter` component is ~10× faster than the parser and more accurate
|
||||||
than the rule-based `sentencizer`.
|
than the rule-based `sentencizer`.
|
||||||
|
|
||||||
|
#### Switch from trainable lemmatizer to default lemmatizer
|
||||||
|
|
||||||
|
Since v3.3, a number of pipelines use a trainable lemmatizer. You can check whether
|
||||||
|
the lemmatizer is trainable:
|
||||||
|
|
||||||
|
```python
|
||||||
|
nlp = spacy.load("de_core_web_sm")
|
||||||
|
assert nlp.get_pipe("lemmatizer").is_trainable
|
||||||
|
```
|
||||||
|
|
||||||
|
If you'd like to switch to a non-trainable lemmatizer that's similar to v3.2 or
|
||||||
|
earlier, you can replace the trainable lemmatizer with the default non-trainable
|
||||||
|
lemmatizer:
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Requirements: pip install spacy-lookups-data
|
||||||
|
nlp = spacy.load("de_core_web_sm")
|
||||||
|
# Remove existing lemmatizer
|
||||||
|
nlp.remove_pipe("lemmatizer")
|
||||||
|
# Add non-trainable lemmatizer from language defaults
|
||||||
|
# and load lemmatizer tables from spacy-lookups-data
|
||||||
|
nlp.add_pipe("lemmatizer").initialize()
|
||||||
|
```
|
||||||
|
|
||||||
#### Switch from rule-based to lookup lemmatization
|
#### Switch from rule-based to lookup lemmatization
|
||||||
|
|
||||||
For the Dutch, English, French, Greek, Macedonian, Norwegian and Spanish
|
For the Dutch, English, French, Greek, Macedonian, Norwegian and Spanish
|
||||||
pipelines, you can switch from the default rule-based lemmatizer to a lookup
|
pipelines, you can swap out a trainable or rule-based lemmatizer for a lookup
|
||||||
lemmatizer:
|
lemmatizer:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
|
|
247
website/docs/usage/v3-3.md
Normal file
247
website/docs/usage/v3-3.md
Normal file
|
@ -0,0 +1,247 @@
|
||||||
|
---
|
||||||
|
title: What's New in v3.3
|
||||||
|
teaser: New features and how to upgrade
|
||||||
|
menu:
|
||||||
|
- ['New Features', 'features']
|
||||||
|
- ['Upgrading Notes', 'upgrading']
|
||||||
|
---
|
||||||
|
|
||||||
|
## New features {#features hidden="true"}
|
||||||
|
|
||||||
|
spaCy v3.3 improves the speed of core pipeline components, adds a new trainable
|
||||||
|
lemmatizer, and introduces trained pipelines for Finnish, Korean and Swedish.
|
||||||
|
|
||||||
|
### Speed improvements {#speed}
|
||||||
|
|
||||||
|
v3.3 includes a slew of speed improvements:
|
||||||
|
|
||||||
|
- Speed up parser and NER by using constant-time head lookups.
|
||||||
|
- Support unnormalized softmax probabilities in `spacy.Tagger.v2` to speed up
|
||||||
|
inference for tagger, morphologizer, senter and trainable lemmatizer.
|
||||||
|
- Speed up parser projectivization functions.
|
||||||
|
- Replace `Ragged` with faster `AlignmentArray` in `Example` for training.
|
||||||
|
- Improve `Matcher` speed.
|
||||||
|
- Improve serialization speed for empty `Doc.spans`.
|
||||||
|
|
||||||
|
For longer texts, the trained pipeline speeds improve **15%** or more in
|
||||||
|
prediction. We benchmarked `en_core_web_md` (same components as in v3.2) and
|
||||||
|
`de_core_news_md` (with the new trainable lemmatizer) across a range of text
|
||||||
|
sizes on Linux (Intel Xeon W-2265) and OS X (M1) to compare spaCy v3.2 vs. v3.3:
|
||||||
|
|
||||||
|
**Intel Xeon W-2265**
|
||||||
|
|
||||||
|
| Model | Avg. Words/Doc | v3.2 Words/Sec | v3.3 Words/Sec | Diff |
|
||||||
|
| :----------------------------------------------- | -------------: | -------------: | -------------: | -----: |
|
||||||
|
| [`en_core_web_md`](/models/en#en_core_web_md) | 100 | 17292 | 17441 | 0.86% |
|
||||||
|
| (=same components) | 1000 | 15408 | 16024 | 4.00% |
|
||||||
|
| | 10000 | 12798 | 15346 | 19.91% |
|
||||||
|
| [`de_core_news_md`](/models/de/#de_core_news_md) | 100 | 20221 | 19321 | -4.45% |
|
||||||
|
| (+v3.3 trainable lemmatizer) | 1000 | 17480 | 17345 | -0.77% |
|
||||||
|
| | 10000 | 14513 | 17036 | 17.38% |
|
||||||
|
|
||||||
|
**Apple M1**
|
||||||
|
|
||||||
|
| Model | Avg. Words/Doc | v3.2 Words/Sec | v3.3 Words/Sec | Diff |
|
||||||
|
| ------------------------------------------------ | -------------: | -------------: | -------------: | -----: |
|
||||||
|
| [`en_core_web_md`](/models/en#en_core_web_md) | 100 | 18272 | 18408 | 0.74% |
|
||||||
|
| (=same components) | 1000 | 18794 | 19248 | 2.42% |
|
||||||
|
| | 10000 | 15144 | 17513 | 15.64% |
|
||||||
|
| [`de_core_news_md`](/models/de/#de_core_news_md) | 100 | 19227 | 19591 | 1.89% |
|
||||||
|
| (+v3.3 trainable lemmatizer) | 1000 | 20047 | 20628 | 2.90% |
|
||||||
|
| | 10000 | 15921 | 18546 | 16.49% |
|
||||||
|
|
||||||
|
### Trainable lemmatizer {#trainable-lemmatizer}
|
||||||
|
|
||||||
|
The new [trainable lemmatizer](/api/edittreelemmatizer) component uses
|
||||||
|
[edit trees](https://explosion.ai/blog/edit-tree-lemmatizer) to transform tokens
|
||||||
|
into lemmas. Try out the trainable lemmatizer with the
|
||||||
|
[training quickstart](/usage/training#quickstart)!
|
||||||
|
|
||||||
|
### displaCy support for overlapping spans and arcs {#displacy}
|
||||||
|
|
||||||
|
displaCy now supports overlapping spans with a new
|
||||||
|
[`span`](/usage/visualizers#span) style and multiple arcs with different labels
|
||||||
|
between the same tokens for [`dep`](/usage/visualizers#dep) visualizations.
|
||||||
|
|
||||||
|
Overlapping spans can be visualized for any spans key in `doc.spans`:
|
||||||
|
|
||||||
|
```python
|
||||||
|
import spacy
|
||||||
|
from spacy import displacy
|
||||||
|
from spacy.tokens import Span
|
||||||
|
|
||||||
|
nlp = spacy.blank("en")
|
||||||
|
text = "Welcome to the Bank of China."
|
||||||
|
doc = nlp(text)
|
||||||
|
doc.spans["custom"] = [Span(doc, 3, 6, "ORG"), Span(doc, 5, 6, "GPE")]
|
||||||
|
displacy.serve(doc, style="span", options={"spans_key": "custom"})
|
||||||
|
```
|
||||||
|
|
||||||
|
import DisplacySpanHtml from 'images/displacy-span.html'
|
||||||
|
|
||||||
|
<Iframe title="displaCy visualizer for overlapping spans" html={DisplacySpanHtml} height={180} />
|
||||||
|
|
||||||
|
## Additional features and improvements
|
||||||
|
|
||||||
|
- Config comparisons with [`spacy debug diff-config`](/api/cli#debug-diff).
|
||||||
|
- Span suggester debugging with
|
||||||
|
[`SpanCategorizer.set_candidates`](/api/spancategorizer#set_candidates).
|
||||||
|
- Big endian support with
|
||||||
|
[`thinc-bigendian-ops`](https://github.com/andrewsi-z/thinc-bigendian-ops) and
|
||||||
|
updates to make `floret`, `murmurhash`, Thinc and spaCy endian neutral.
|
||||||
|
- Initial support for Lower Sorbian and Upper Sorbian.
|
||||||
|
- Language updates for English, French, Italian, Japanese, Korean, Norwegian,
|
||||||
|
Russian, Slovenian, Spanish, Turkish, Ukrainian and Vietnamese.
|
||||||
|
- New noun chunks for Finnish.
|
||||||
|
|
||||||
|
## Trained pipelines {#pipelines}
|
||||||
|
|
||||||
|
### New trained pipelines {#new-pipelines}
|
||||||
|
|
||||||
|
v3.3 introduces new CPU/CNN pipelines for Finnish, Korean and Swedish, which use
|
||||||
|
the new trainable lemmatizer and
|
||||||
|
[floret vectors](https://github.com/explosion/floret). Due to the use
|
||||||
|
[Bloom embeddings](https://explosion.ai/blog/bloom-embeddings) and subwords, the
|
||||||
|
pipelines have compact vectors with no out-of-vocabulary words.
|
||||||
|
|
||||||
|
| Package | Language | UPOS | Parser LAS | NER F |
|
||||||
|
| ----------------------------------------------- | -------- | ---: | ---------: | ----: |
|
||||||
|
| [`fi_core_news_sm`](/models/fi#fi_core_news_sm) | Finnish | 92.5 | 71.9 | 75.9 |
|
||||||
|
| [`fi_core_news_md`](/models/fi#fi_core_news_md) | Finnish | 95.9 | 78.6 | 80.6 |
|
||||||
|
| [`fi_core_news_lg`](/models/fi#fi_core_news_lg) | Finnish | 96.2 | 79.4 | 82.4 |
|
||||||
|
| [`ko_core_news_sm`](/models/ko#ko_core_news_sm) | Korean | 86.1 | 65.6 | 71.3 |
|
||||||
|
| [`ko_core_news_md`](/models/ko#ko_core_news_md) | Korean | 94.7 | 80.9 | 83.1 |
|
||||||
|
| [`ko_core_news_lg`](/models/ko#ko_core_news_lg) | Korean | 94.7 | 81.3 | 85.3 |
|
||||||
|
| [`sv_core_news_sm`](/models/sv#sv_core_news_sm) | Swedish | 95.0 | 75.9 | 74.7 |
|
||||||
|
| [`sv_core_news_md`](/models/sv#sv_core_news_md) | Swedish | 96.3 | 78.5 | 79.3 |
|
||||||
|
| [`sv_core_news_lg`](/models/sv#sv_core_news_lg) | Swedish | 96.3 | 79.1 | 81.1 |
|
||||||
|
|
||||||
|
### Pipeline updates {#pipeline-updates}
|
||||||
|
|
||||||
|
The following languages switch from lookup or rule-based lemmatizers to the new
|
||||||
|
trainable lemmatizer: Danish, Dutch, German, Greek, Italian, Lithuanian,
|
||||||
|
Norwegian, Polish, Portuguese and Romanian. The overall lemmatizer accuracy
|
||||||
|
improves for all of these pipelines, but be aware that the types of errors may
|
||||||
|
look quite different from the lookup-based lemmatizers. If you'd prefer to
|
||||||
|
continue using the previous lemmatizer, you can
|
||||||
|
[switch from the trainable lemmatizer to a non-trainable lemmatizer](/models#design-modify).
|
||||||
|
|
||||||
|
<figure>
|
||||||
|
|
||||||
|
| Model | v3.2 Lemma Acc | v3.3 Lemma Acc |
|
||||||
|
| ----------------------------------------------- | -------------: | -------------: |
|
||||||
|
| [`da_core_news_md`](/models/da#da_core_news_md) | 84.9 | 94.8 |
|
||||||
|
| [`de_core_news_md`](/models/de#de_core_news_md) | 73.4 | 97.7 |
|
||||||
|
| [`el_core_news_md`](/models/el#el_core_news_md) | 56.5 | 88.9 |
|
||||||
|
| [`fi_core_news_md`](/models/fi#fi_core_news_md) | - | 86.2 |
|
||||||
|
| [`it_core_news_md`](/models/it#it_core_news_md) | 86.6 | 97.2 |
|
||||||
|
| [`ko_core_news_md`](/models/ko#ko_core_news_md) | - | 90.0 |
|
||||||
|
| [`lt_core_news_md`](/models/lt#lt_core_news_md) | 71.1 | 84.8 |
|
||||||
|
| [`nb_core_news_md`](/models/nb#nb_core_news_md) | 76.7 | 97.1 |
|
||||||
|
| [`nl_core_news_md`](/models/nl#nl_core_news_md) | 81.5 | 94.0 |
|
||||||
|
| [`pl_core_news_md`](/models/pl#pl_core_news_md) | 87.1 | 93.7 |
|
||||||
|
| [`pt_core_news_md`](/models/pt#pt_core_news_md) | 76.7 | 96.9 |
|
||||||
|
| [`ro_core_news_md`](/models/ro#ro_core_news_md) | 81.8 | 95.5 |
|
||||||
|
| [`sv_core_news_md`](/models/sv#sv_core_news_md) | - | 95.5 |
|
||||||
|
|
||||||
|
</figure>
|
||||||
|
|
||||||
|
In addition, the vectors in the English pipelines are deduplicated to improve
|
||||||
|
the pruned vectors in the `md` models and reduce the `lg` model size.
|
||||||
|
|
||||||
|
## Notes about upgrading from v3.2 {#upgrading}
|
||||||
|
|
||||||
|
### Span comparisons
|
||||||
|
|
||||||
|
Span comparisons involving ordering (`<`, `<=`, `>`, `>=`) now take all span
|
||||||
|
attributes into account (start, end, label, and KB ID) so spans may be sorted in
|
||||||
|
a slightly different order.
|
||||||
|
|
||||||
|
### Whitespace annotation
|
||||||
|
|
||||||
|
During training, annotation on whitespace tokens is handled in the same way as
|
||||||
|
annotation on non-whitespace tokens in order to allow custom whitespace
|
||||||
|
annotation.
|
||||||
|
|
||||||
|
### Doc.from_docs
|
||||||
|
|
||||||
|
[`Doc.from_docs`](/api/doc#from_docs) now includes `Doc.tensor` by default and
|
||||||
|
supports excludes with an `exclude` argument in the same format as
|
||||||
|
`Doc.to_bytes`. The supported exclude fields are `spans`, `tensor` and
|
||||||
|
`user_data`.
|
||||||
|
|
||||||
|
Docs including `Doc.tensor` may be quite a bit larger in RAM, so to exclude
|
||||||
|
`Doc.tensor` as in v3.2:
|
||||||
|
|
||||||
|
```diff
|
||||||
|
-merged_doc = Doc.from_docs(docs)
|
||||||
|
+merged_doc = Doc.from_docs(docs, exclude=["tensor"])
|
||||||
|
```
|
||||||
|
|
||||||
|
### Using trained pipelines with floret vectors
|
||||||
|
|
||||||
|
If you're running a new trained pipeline for Finnish, Korean or Swedish on new
|
||||||
|
texts and working with `Doc` objects, you shouldn't notice any difference with
|
||||||
|
floret vectors vs. default vectors.
|
||||||
|
|
||||||
|
If you use vectors for similarity comparisons, there are a few differences,
|
||||||
|
mainly because a floret pipeline doesn't include any kind of frequency-based
|
||||||
|
word list similar to the list of in-vocabulary vector keys with default vectors.
|
||||||
|
|
||||||
|
- If your workflow iterates over the vector keys, you should use an external
|
||||||
|
word list instead:
|
||||||
|
|
||||||
|
```diff
|
||||||
|
- lexemes = [nlp.vocab[orth] for orth in nlp.vocab.vectors]
|
||||||
|
+ lexemes = [nlp.vocab[word] for word in external_word_list]
|
||||||
|
```
|
||||||
|
|
||||||
|
- `Vectors.most_similar` is not supported because there's no fixed list of
|
||||||
|
vectors to compare your vectors to.
|
||||||
|
|
||||||
|
### Pipeline package version compatibility {#version-compat}
|
||||||
|
|
||||||
|
> #### Using legacy implementations
|
||||||
|
>
|
||||||
|
> In spaCy v3, you'll still be able to load and reference legacy implementations
|
||||||
|
> via [`spacy-legacy`](https://github.com/explosion/spacy-legacy), even if the
|
||||||
|
> components or architectures change and newer versions are available in the
|
||||||
|
> core library.
|
||||||
|
|
||||||
|
When you're loading a pipeline package trained with an earlier version of spaCy
|
||||||
|
v3, you will see a warning telling you that the pipeline may be incompatible.
|
||||||
|
This doesn't necessarily have to be true, but we recommend running your
|
||||||
|
pipelines against your test suite or evaluation data to make sure there are no
|
||||||
|
unexpected results.
|
||||||
|
|
||||||
|
If you're using one of the [trained pipelines](/models) we provide, you should
|
||||||
|
run [`spacy download`](/api/cli#download) to update to the latest version. To
|
||||||
|
see an overview of all installed packages and their compatibility, you can run
|
||||||
|
[`spacy validate`](/api/cli#validate).
|
||||||
|
|
||||||
|
If you've trained your own custom pipeline and you've confirmed that it's still
|
||||||
|
working as expected, you can update the spaCy version requirements in the
|
||||||
|
[`meta.json`](/api/data-formats#meta):
|
||||||
|
|
||||||
|
```diff
|
||||||
|
- "spacy_version": ">=3.2.0,<3.3.0",
|
||||||
|
+ "spacy_version": ">=3.2.0,<3.4.0",
|
||||||
|
```
|
||||||
|
|
||||||
|
### Updating v3.2 configs
|
||||||
|
|
||||||
|
To update a config from spaCy v3.2 with the new v3.3 settings, run
|
||||||
|
[`init fill-config`](/api/cli#init-fill-config):
|
||||||
|
|
||||||
|
```cli
|
||||||
|
$ python -m spacy init fill-config config-v3.2.cfg config-v3.3.cfg
|
||||||
|
```
|
||||||
|
|
||||||
|
In many cases ([`spacy train`](/api/cli#train),
|
||||||
|
[`spacy.load`](/api/top-level#spacy.load)), the new defaults will be filled in
|
||||||
|
automatically, but you'll need to fill in the new settings to run
|
||||||
|
[`debug config`](/api/cli#debug) and [`debug data`](/api/cli#debug-data).
|
||||||
|
|
||||||
|
To see the speed improvements for the
|
||||||
|
[`Tagger` architecture](/api/architectures#Tagger), edit your config to switch
|
||||||
|
from `spacy.Tagger.v1` to `spacy.Tagger.v2` and then run `init fill-config`.
|
|
@ -5,6 +5,7 @@ new: 2
|
||||||
menu:
|
menu:
|
||||||
- ['Dependencies', 'dep']
|
- ['Dependencies', 'dep']
|
||||||
- ['Named Entities', 'ent']
|
- ['Named Entities', 'ent']
|
||||||
|
- ['Spans', 'span']
|
||||||
- ['Jupyter Notebooks', 'jupyter']
|
- ['Jupyter Notebooks', 'jupyter']
|
||||||
- ['Rendering HTML', 'html']
|
- ['Rendering HTML', 'html']
|
||||||
- ['Web app usage', 'webapp']
|
- ['Web app usage', 'webapp']
|
||||||
|
@ -192,7 +193,7 @@ displacy.serve(doc, style="span")
|
||||||
|
|
||||||
import DisplacySpanHtml from 'images/displacy-span.html'
|
import DisplacySpanHtml from 'images/displacy-span.html'
|
||||||
|
|
||||||
<Iframe title="displaCy visualizer for entities" html={DisplacySpanHtml} height={180} />
|
<Iframe title="displaCy visualizer for overlapping spans" html={DisplacySpanHtml} height={180} />
|
||||||
|
|
||||||
|
|
||||||
The span visualizer lets you customize the following `options`:
|
The span visualizer lets you customize the following `options`:
|
||||||
|
|
|
@ -62,6 +62,11 @@
|
||||||
"example": "Dies ist ein Satz.",
|
"example": "Dies ist ein Satz.",
|
||||||
"has_examples": true
|
"has_examples": true
|
||||||
},
|
},
|
||||||
|
{
|
||||||
|
"code": "dsb",
|
||||||
|
"name": "Lower Sorbian",
|
||||||
|
"has_examples": true
|
||||||
|
},
|
||||||
{
|
{
|
||||||
"code": "el",
|
"code": "el",
|
||||||
"name": "Greek",
|
"name": "Greek",
|
||||||
|
@ -159,6 +164,11 @@
|
||||||
"name": "Croatian",
|
"name": "Croatian",
|
||||||
"has_examples": true
|
"has_examples": true
|
||||||
},
|
},
|
||||||
|
{
|
||||||
|
"code": "hsb",
|
||||||
|
"name": "Upper Sorbian",
|
||||||
|
"has_examples": true
|
||||||
|
},
|
||||||
{
|
{
|
||||||
"code": "hu",
|
"code": "hu",
|
||||||
"name": "Hungarian",
|
"name": "Hungarian",
|
||||||
|
|
|
@ -11,7 +11,8 @@
|
||||||
{ "text": "spaCy 101", "url": "/usage/spacy-101" },
|
{ "text": "spaCy 101", "url": "/usage/spacy-101" },
|
||||||
{ "text": "New in v3.0", "url": "/usage/v3" },
|
{ "text": "New in v3.0", "url": "/usage/v3" },
|
||||||
{ "text": "New in v3.1", "url": "/usage/v3-1" },
|
{ "text": "New in v3.1", "url": "/usage/v3-1" },
|
||||||
{ "text": "New in v3.2", "url": "/usage/v3-2" }
|
{ "text": "New in v3.2", "url": "/usage/v3-2" },
|
||||||
|
{ "text": "New in v3.3", "url": "/usage/v3-3" }
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
|
|
|
@ -120,8 +120,8 @@ const AlertSpace = ({ nightly, legacy }) => {
|
||||||
}
|
}
|
||||||
|
|
||||||
const navAlert = (
|
const navAlert = (
|
||||||
<Link to="/usage/v3-2" hidden>
|
<Link to="/usage/v3-3" hidden>
|
||||||
<strong>💥 Out now:</strong> spaCy v3.2
|
<strong>💥 Out now:</strong> spaCy v3.3
|
||||||
</Link>
|
</Link>
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|
Loading…
Reference in New Issue
Block a user