Add backwards incompatibility [ci skip]

2025-11-09 04:17:53 +03:00 · 2019-09-18 21:21:48 +02:00 · 2019-09-18 21:21:48 +02:00 · f873548f6c
commit f873548f6c
parent 6ebdc5f7d2
1 changed files with 30 additions and 1 deletions
--- a/website/docs/usage/v2-2.md
+++ b/website/docs/usage/v2-2.md
@ -326,4 +326,33 @@ check if all of your models are up to date, you can run the

 </Infobox>

-<!-- TODO: copy from release notes once they're ready -->
+- The Dutch models have been trained on a new NER corpus (custom labelled UD
+  instead of WikiNER), so their predictions may be very different compared to
+  the previous version. The results should be significantly better and more
+  generalizable, though.
+- The `spacy download` command does **not** set the `--no-deps` pip argument
+  anymore by default, meaning that model package dependencies (if available)
+  will now be also downloaded and installed. If spaCy (which is also a model
+  dependency) is not installed in the current environment, e.g. if a user has
+  built from source, `--no-deps` is added back automatically to prevent spaCy
+  from being downloaded and installed again from pip.
+- The built-in `biluo_tags_from_offsets` converter is now stricter and will
+  raise an error if entities are overlapping (instead of silently skipping
+  them). If your data contains invalid entity annotations, make sure to clean it
+  and resolve conflicts. You can now also use the new `debug-data` command to
+  find problems in your data.
+- The default punctuation in the `sentencizer` has been extended and now
+  includes more characters common in various languages. This also means that the
+  results it produces may change, depending on your text. If you want the
+  previous behaviour with limited characters, set `punct_chars=[".", "!", "?"]`
+  on initialization.
+- Lemmatization tables (rules, exceptions, index and lookups) are now part of
+  the `Vocab` and serialized with it. This means that serialized objects (`nlp`,
+  pipeline components, vocab) will now include additional data, and models
+  written to disk will include additional files.
+- The `Serbian` language class (introduced in v2.1.8) incorrectly used the
+  language code `rs` instead of `sr`. This has now been fixed, so `Serbian` is
+  now available via `spacy.lang.sr`.
+- The `"sources"` in the `meta.json` have changed from a list of strings to a
+  list of dicts. This is mostly internals, but if your code used
+  `nlp.meta["sources"]`, you might have to update it.