Add backwards incompatibility [ci skip]

2025-11-09 04:17:53 +03:00 · 2019-09-18 21:21:48 +02:00 · 2019-09-18 21:21:48 +02:00 · f873548f6c
commit f873548f6c
parent 6ebdc5f7d2
1 changed files with 30 additions and 1 deletions
--- a/website/docs/usage/v2-2.md
+++ b/website/docs/usage/v2-2.md
@ -326,4 +326,33 @@ check if all of your models are up to date, you can run the
 </Infobox>
-<!-- TODO: copy from release notes once they're ready -->
+- The Dutch models have been trained on a new NER corpus (custom labelled UD
  instead of WikiNER), so their predictions may be very different compared to
  the previous version. The results should be significantly better and more
  generalizable, though.
 - The `spacy download` command does **not** set the `--no-deps` pip argument
  anymore by default, meaning that model package dependencies (if available)
  will now be also downloaded and installed. If spaCy (which is also a model
  dependency) is not installed in the current environment, e.g. if a user has
  built from source, `--no-deps` is added back automatically to prevent spaCy
  from being downloaded and installed again from pip.
 - The built-in `biluo_tags_from_offsets` converter is now stricter and will
  raise an error if entities are overlapping (instead of silently skipping
  them). If your data contains invalid entity annotations, make sure to clean it
  and resolve conflicts. You can now also use the new `debug-data` command to
  find problems in your data.
 - The default punctuation in the `sentencizer` has been extended and now
  includes more characters common in various languages. This also means that the
  results it produces may change, depending on your text. If you want the
  previous behaviour with limited characters, set `punct_chars=[".", "!", "?"]`
  on initialization.
 - Lemmatization tables (rules, exceptions, index and lookups) are now part of
  the `Vocab` and serialized with it. This means that serialized objects (`nlp`,
  pipeline components, vocab) will now include additional data, and models
  written to disk will include additional files.
 - The `Serbian` language class (introduced in v2.1.8) incorrectly used the
  language code `rs` instead of `sr`. This has now been fixed, so `Serbian` is
  now available via `spacy.lang.sr`.
 - The `"sources"` in the `meta.json` have changed from a list of strings to a
  list of dicts. This is mostly internals, but if your code used
  `nlp.meta["sources"]`, you might have to update it.