mirror of
https://github.com/explosion/spaCy.git
synced 2024-12-26 01:46:28 +03:00
Add backwards incompatibility [ci skip]
This commit is contained in:
parent
6ebdc5f7d2
commit
f873548f6c
|
@ -326,4 +326,33 @@ check if all of your models are up to date, you can run the
|
||||||
|
|
||||||
</Infobox>
|
</Infobox>
|
||||||
|
|
||||||
<!-- TODO: copy from release notes once they're ready -->
|
- The Dutch models have been trained on a new NER corpus (custom labelled UD
|
||||||
|
instead of WikiNER), so their predictions may be very different compared to
|
||||||
|
the previous version. The results should be significantly better and more
|
||||||
|
generalizable, though.
|
||||||
|
- The `spacy download` command does **not** set the `--no-deps` pip argument
|
||||||
|
anymore by default, meaning that model package dependencies (if available)
|
||||||
|
will now be also downloaded and installed. If spaCy (which is also a model
|
||||||
|
dependency) is not installed in the current environment, e.g. if a user has
|
||||||
|
built from source, `--no-deps` is added back automatically to prevent spaCy
|
||||||
|
from being downloaded and installed again from pip.
|
||||||
|
- The built-in `biluo_tags_from_offsets` converter is now stricter and will
|
||||||
|
raise an error if entities are overlapping (instead of silently skipping
|
||||||
|
them). If your data contains invalid entity annotations, make sure to clean it
|
||||||
|
and resolve conflicts. You can now also use the new `debug-data` command to
|
||||||
|
find problems in your data.
|
||||||
|
- The default punctuation in the `sentencizer` has been extended and now
|
||||||
|
includes more characters common in various languages. This also means that the
|
||||||
|
results it produces may change, depending on your text. If you want the
|
||||||
|
previous behaviour with limited characters, set `punct_chars=[".", "!", "?"]`
|
||||||
|
on initialization.
|
||||||
|
- Lemmatization tables (rules, exceptions, index and lookups) are now part of
|
||||||
|
the `Vocab` and serialized with it. This means that serialized objects (`nlp`,
|
||||||
|
pipeline components, vocab) will now include additional data, and models
|
||||||
|
written to disk will include additional files.
|
||||||
|
- The `Serbian` language class (introduced in v2.1.8) incorrectly used the
|
||||||
|
language code `rs` instead of `sr`. This has now been fixed, so `Serbian` is
|
||||||
|
now available via `spacy.lang.sr`.
|
||||||
|
- The `"sources"` in the `meta.json` have changed from a list of strings to a
|
||||||
|
list of dicts. This is mostly internals, but if your code used
|
||||||
|
`nlp.meta["sources"]`, you might have to update it.
|
||||||
|
|
Loading…
Reference in New Issue
Block a user