Update v2-2.md [ci skip]

This commit is contained in:
Ines Montani 2019-09-27 16:35:01 +02:00
parent aad66d9bb9
commit 685e4b2554

View File

@ -336,31 +336,39 @@ check if all of your models are up to date, you can run the
</Infobox> </Infobox>
- The Dutch models have been trained on a new NER corpus (custom labelled UD - The [Dutch model](/models/nl) has been trained on a new NER corpus (custom
instead of WikiNER), so their predictions may be very different compared to labelled UD instead of WikiNER), so their predictions may be very different
the previous version. The results should be significantly better and more compared to the previous version. The results should be significantly better
generalizable, though. and more generalizable, though.
- The `spacy download` command does **not** set the `--no-deps` pip argument - The [`spacy download`](/api/cli#download) command does **not** set the
anymore by default, meaning that model package dependencies (if available) `--no-deps` pip argument anymore by default, meaning that model package
will now be also downloaded and installed. If spaCy (which is also a model dependencies (if available) will now be also downloaded and installed. If
dependency) is not installed in the current environment, e.g. if a user has spaCy (which is also a model dependency) is not installed in the current
built from source, `--no-deps` is added back automatically to prevent spaCy environment, e.g. if a user has built from source, `--no-deps` is added back
from being downloaded and installed again from pip. automatically to prevent spaCy from being downloaded and installed again from
- The built-in `biluo_tags_from_offsets` converter is now stricter and will pip.
raise an error if entities are overlapping (instead of silently skipping - The built-in
them). If your data contains invalid entity annotations, make sure to clean it [`biluo_tags_from_offsets`](/api/goldparse#biluo_tags_from_offsets) converter
and resolve conflicts. You can now also use the new `debug-data` command to is now stricter and will raise an error if entities are overlapping (instead
find problems in your data. of silently skipping them). If your data contains invalid entity annotations,
make sure to clean it and resolve conflicts. You can now also use the new
`debug-data` command to find problems in your data.
- Pipeline components can now overwrite IOB tags of tokens that are not yet part - Pipeline components can now overwrite IOB tags of tokens that are not yet part
of an entity. Once a token has an `ent_iob` value set, it won't be reset to an of an entity. Once a token has an `ent_iob` value set, it won't be reset to an
"unset" state and will always have at least `O` assigned. `list(doc.ents)` now "unset" state and will always have at least `O` assigned. `list(doc.ents)` now
actually keeps the annotations on the token level consistent, instead of actually keeps the annotations on the token level consistent, instead of
resetting `O` to an empty string. resetting `O` to an empty string.
- The default punctuation in the `sentencizer` has been extended and now - The default punctuation in the [`Sentencizer`](/api/sentencizer) has been
includes more characters common in various languages. This also means that the extended and now includes more characters common in various languages. This
results it produces may change, depending on your text. If you want the also means that the results it produces may change, depending on your text. If
previous behaviour with limited characters, set `punct_chars=[".", "!", "?"]` you want the previous behaviour with limited characters, set
on initialization. `punct_chars=[".", "!", "?"]` on initialization.
- The [`PhraseMatcher`](/api/phrasematcher) algorithm was rewritten from scratch
and it's now 10&times; faster. The rewrite also resolved a few subtle bugs
with very large terminology lists. So if you were matching large lists, you
may see slightly different results however, the results should now be fully
correct. See [this PR](https://github.com/explosion/spaCy/pulls/4309) for more
details.
- Lemmatization tables (rules, exceptions, index and lookups) are now part of - Lemmatization tables (rules, exceptions, index and lookups) are now part of
the `Vocab` and serialized with it. This means that serialized objects (`nlp`, the `Vocab` and serialized with it. This means that serialized objects (`nlp`,
pipeline components, vocab) will now include additional data, and models pipeline components, vocab) will now include additional data, and models