Update v2-2.md [ci skip]

This commit is contained in:
Ines Montani 2019-09-27 16:35:01 +02:00
parent aad66d9bb9
commit 685e4b2554

View File

@ -336,31 +336,39 @@ check if all of your models are up to date, you can run the
</Infobox>
- The Dutch models have been trained on a new NER corpus (custom labelled UD
instead of WikiNER), so their predictions may be very different compared to
the previous version. The results should be significantly better and more
generalizable, though.
- The `spacy download` command does **not** set the `--no-deps` pip argument
anymore by default, meaning that model package dependencies (if available)
will now be also downloaded and installed. If spaCy (which is also a model
dependency) is not installed in the current environment, e.g. if a user has
built from source, `--no-deps` is added back automatically to prevent spaCy
from being downloaded and installed again from pip.
- The built-in `biluo_tags_from_offsets` converter is now stricter and will
raise an error if entities are overlapping (instead of silently skipping
them). If your data contains invalid entity annotations, make sure to clean it
and resolve conflicts. You can now also use the new `debug-data` command to
find problems in your data.
- The [Dutch model](/models/nl) has been trained on a new NER corpus (custom
labelled UD instead of WikiNER), so their predictions may be very different
compared to the previous version. The results should be significantly better
and more generalizable, though.
- The [`spacy download`](/api/cli#download) command does **not** set the
`--no-deps` pip argument anymore by default, meaning that model package
dependencies (if available) will now be also downloaded and installed. If
spaCy (which is also a model dependency) is not installed in the current
environment, e.g. if a user has built from source, `--no-deps` is added back
automatically to prevent spaCy from being downloaded and installed again from
pip.
- The built-in
[`biluo_tags_from_offsets`](/api/goldparse#biluo_tags_from_offsets) converter
is now stricter and will raise an error if entities are overlapping (instead
of silently skipping them). If your data contains invalid entity annotations,
make sure to clean it and resolve conflicts. You can now also use the new
`debug-data` command to find problems in your data.
- Pipeline components can now overwrite IOB tags of tokens that are not yet part
of an entity. Once a token has an `ent_iob` value set, it won't be reset to an
"unset" state and will always have at least `O` assigned. `list(doc.ents)` now
actually keeps the annotations on the token level consistent, instead of
resetting `O` to an empty string.
- The default punctuation in the `sentencizer` has been extended and now
includes more characters common in various languages. This also means that the
results it produces may change, depending on your text. If you want the
previous behaviour with limited characters, set `punct_chars=[".", "!", "?"]`
on initialization.
- The default punctuation in the [`Sentencizer`](/api/sentencizer) has been
extended and now includes more characters common in various languages. This
also means that the results it produces may change, depending on your text. If
you want the previous behaviour with limited characters, set
`punct_chars=[".", "!", "?"]` on initialization.
- The [`PhraseMatcher`](/api/phrasematcher) algorithm was rewritten from scratch
and it's now 10&times; faster. The rewrite also resolved a few subtle bugs
with very large terminology lists. So if you were matching large lists, you
may see slightly different results however, the results should now be fully
correct. See [this PR](https://github.com/explosion/spaCy/pulls/4309) for more
details.
- Lemmatization tables (rules, exceptions, index and lookups) are now part of
the `Vocab` and serialized with it. This means that serialized objects (`nlp`,
pipeline components, vocab) will now include additional data, and models