From f873548f6c26a91715beaf91d04831e29e60df2f Mon Sep 17 00:00:00 2001 From: Ines Montani Date: Wed, 18 Sep 2019 21:21:48 +0200 Subject: [PATCH] Add backwards incompatibility [ci skip] --- website/docs/usage/v2-2.md | 31 ++++++++++++++++++++++++++++++- 1 file changed, 30 insertions(+), 1 deletion(-) diff --git a/website/docs/usage/v2-2.md b/website/docs/usage/v2-2.md index 376a9ae10..6c2d3c158 100644 --- a/website/docs/usage/v2-2.md +++ b/website/docs/usage/v2-2.md @@ -326,4 +326,33 @@ check if all of your models are up to date, you can run the - +- The Dutch models have been trained on a new NER corpus (custom labelled UD + instead of WikiNER), so their predictions may be very different compared to + the previous version. The results should be significantly better and more + generalizable, though. +- The `spacy download` command does **not** set the `--no-deps` pip argument + anymore by default, meaning that model package dependencies (if available) + will now be also downloaded and installed. If spaCy (which is also a model + dependency) is not installed in the current environment, e.g. if a user has + built from source, `--no-deps` is added back automatically to prevent spaCy + from being downloaded and installed again from pip. +- The built-in `biluo_tags_from_offsets` converter is now stricter and will + raise an error if entities are overlapping (instead of silently skipping + them). If your data contains invalid entity annotations, make sure to clean it + and resolve conflicts. You can now also use the new `debug-data` command to + find problems in your data. +- The default punctuation in the `sentencizer` has been extended and now + includes more characters common in various languages. This also means that the + results it produces may change, depending on your text. If you want the + previous behaviour with limited characters, set `punct_chars=[".", "!", "?"]` + on initialization. +- Lemmatization tables (rules, exceptions, index and lookups) are now part of + the `Vocab` and serialized with it. This means that serialized objects (`nlp`, + pipeline components, vocab) will now include additional data, and models + written to disk will include additional files. +- The `Serbian` language class (introduced in v2.1.8) incorrectly used the + language code `rs` instead of `sr`. This has now been fixed, so `Serbian` is + now available via `spacy.lang.sr`. +- The `"sources"` in the `meta.json` have changed from a list of strings to a + list of dicts. This is mostly internals, but if your code used + `nlp.meta["sources"]`, you might have to update it.