mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-25 00:34:20 +03:00
554df9ef20
* Rename all MDX file to `.mdx`
* Lock current node version (#11885)
* Apply Prettier (#11996)
* Minor website fixes (#11974) [ci skip]
* fix table
* Migrate to Next WEB-17 (#12005)
* Initial commit
* Run `npx create-next-app@13 next-blog`
* Install MDX packages
Following: 77b5f79a4d/packages/next-mdx/readme.md
* Add MDX to Next
* Allow Next to handle `.md` and `.mdx` files.
* Add VSCode extension recommendation
* Disabled TypeScript strict mode for now
* Add prettier
* Apply Prettier to all files
* Make sure to use correct Node version
* Add basic implementation for `MDXRemote`
* Add experimental Rust MDX parser
* Add `/public`
* Add SASS support
* Remove default pages and styling
* Convert to module
This allows to use `import/export` syntax
* Add import for custom components
* Add ability to load plugins
* Extract function
This will make the next commit easier to read
* Allow to handle directories for page creation
* Refactoring
* Allow to parse subfolders for pages
* Extract logic
* Redirect `index.mdx` to parent directory
* Disabled ESLint during builds
* Disabled typescript during build
* Remove Gatsby from `README.md`
* Rephrase Docker part of `README.md`
* Update project structure in `README.md`
* Move and rename plugins
* Update plugin for wrapping sections
* Add dependencies for plugin
* Use plugin
* Rename wrapper type
* Simplify unnessary adding of id to sections
The slugified section ids are useless, because they can not be referenced anywhere anyway. The navigation only works if the section has the same id as the heading.
* Add plugin for custom attributes on Markdown elements
* Add plugin to readd support for tables
* Add plugin to fix problem with wrapped images
For more details see this issue: https://github.com/mdx-js/mdx/issues/1798
* Add necessary meta data to pages
* Install necessary dependencies
* Remove outdated MDX handling
* Remove reliance on `InlineList`
* Use existing Remark components
* Remove unallowed heading
Before `h1` components where not overwritten and would never have worked and they aren't used anywhere either.
* Add missing components to MDX
* Add correct styling
* Fix broken list
* Fix broken CSS classes
* Implement layout
* Fix links
* Fix broken images
* Fix pattern image
* Fix heading attributes
* Rename heading attribute
`new` was causing some weird issue, so renaming it to `version`
* Update comment syntax in MDX
* Merge imports
* Fix markdown rendering inside components
* Add model pages
* Simplify anchors
* Fix default value for theme
* Add Universe index page
* Add Universe categories
* Add Universe projects
* Fix Next problem with copy
Next complains when the server renders something different then the client, therfor we move the differing logic to `useEffect`
* Fix improper component nesting
Next doesn't allow block elements inside a `<p>`
* Replace landing page MDX with page component
* Remove inlined iframe content
* Remove ability to inline HTML content in iFrames
* Remove MDX imports
* Fix problem with image inside link in MDX
* Escape character for MDX
* Fix unescaped characters in MDX
* Fix headings with logo
* Allow to export static HTML pages
* Add prebuild script
This command is automatically run by Next
* Replace `svg-loader` with `react-inlinesvg`
`svg-loader` is no longer maintained
* Fix ESLint `react-hooks/exhaustive-deps`
* Fix dropdowns
* Change code language from `cli` to `bash`
* Remove unnessary language `none`
* Fix invalid code language
`markdown_` with an underscore was used to basically turn of syntax highlighting, but using unknown languages know throws an error.
* Enable code blocks plugin
* Readd `InlineCode` component
MDX2 removed the `inlineCode` component
> The special component name `inlineCode` was removed, we recommend to use `pre` for the block version of code, and code for both the block and inline versions
Source: https://mdxjs.com/migrating/v2/#update-mdx-content
* Remove unused code
* Extract function to own file
* Fix code syntax highlighting
* Update syntax for code block meta data
* Remove unused prop
* Fix internal link recognition
There is a problem with regex between Node and browser, and since Next runs the component on both, this create an error.
`Prop `rel` did not match. Server: "null" Client: "noopener nofollow noreferrer"`
This simplifies the implementation and fixes the above error.
* Replace `react-helmet` with `next/head`
* Fix `className` problem for JSX component
* Fix broken bold markdown
* Convert file to `.mjs` to be used by Node process
* Add plugin to replace strings
* Fix custom table row styling
* Fix problem with `span` inside inline `code`
React doesn't allow a `span` inside an inline `code` element and throws an error in dev mode.
* Add `_document` to be able to customize `<html>` and `<body>`
* Add `lang="en"`
* Store Netlify settings in file
This way we don't need to update via Netlify UI, which can be tricky if changing build settings.
* Add sitemap
* Add Smartypants
* Add PWA support
* Add `manifest.webmanifest`
* Fix bug with anchor links after reloading
There was no need for the previous implementation, since the browser handles this nativly. Additional the manual scrolling into view was actually broken, because the heading would disappear behind the menu bar.
* Rename custom event
I was googeling for ages to find out what kind of event `inview` is, only to figure out it was a custom event with a name that sounds pretty much like a native one. 🫠
* Fix missing comment syntax highlighting
* Refactor Quickstart component
The previous implementation was hidding the irrelevant lines via data-props and dynamically generated CSS. This created problems with Next and was also hard to follow. CSS was used to do what React is supposed to handle.
The new implementation simplfy filters the list of children (React elements) via their props.
* Fix syntax highlighting for Training Quickstart
* Unify code rendering
* Improve error logging in Juniper
* Fix Juniper component
* Automatically generate "Read Next" link
* Add Plausible
* Use recent DocSearch component and adjust styling
* Fix images
* Turn of image optimization
> Image Optimization using Next.js' default loader is not compatible with `next export`.
We currently deploy to Netlify via `next export`
* Dont build pages starting with `_`
* Remove unused files
* Add Next plugin to Netlify
* Fix button layout
MDX automatically adds `p` tags around text on a new line and Prettier wants to put the text on a new line. Hacking with JSX string.
* Add 404 page
* Apply Prettier
* Update Prettier for `package.json`
Next sometimes wants to patch `package-lock.json`. The old Prettier setting indended with 4 spaces, but Next always indends with 2 spaces. Since `npm install` automatically uses the indendation from `package.json` for `package-lock.json` and to avoid the format switching back and forth, both files are now set to 2 spaces.
* Apply Next patch to `package-lock.json`
When starting the dev server Next would warn `warn - Found lockfile missing swc dependencies, patching...` and update the `package-lock.json`. These are the patched changes.
* fix link
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* small backslash fixes
* adjust to new style
Co-authored-by: Marcus Blättermann <marcus@essenmitsosse.de>
345 lines
13 KiB
Plaintext
345 lines
13 KiB
Plaintext
---
|
|
title: What's New in v2.3
|
|
teaser: New features, backwards incompatibilities and migration guide
|
|
menu:
|
|
- ['New Features', 'features']
|
|
- ['Backwards Incompatibilities', 'incompat']
|
|
- ['Migrating from v2.2', 'migrating']
|
|
---
|
|
|
|
## New Features {id="features",hidden="true"}
|
|
|
|
spaCy v2.3 features new pretrained models for five languages, word vectors for
|
|
all language models, and decreased model size and loading times for models with
|
|
vectors. We've added pretrained models for **Chinese, Danish, Japanese, Polish
|
|
and Romanian** and updated the training data and vectors for most languages.
|
|
Model packages with vectors are about **2×** smaller on disk and load
|
|
**2-4×** faster. For the full changelog, see the
|
|
[release notes on GitHub](https://github.com/explosion/spaCy/releases/tag/v2.3.0).
|
|
For more details and a behind-the-scenes look at the new release,
|
|
[see our blog post](https://explosion.ai/blog/spacy-v2-3).
|
|
|
|
### Expanded model families with vectors {id="models"}
|
|
|
|
> #### Example
|
|
>
|
|
> ```bash
|
|
> python -m spacy download da_core_news_sm
|
|
> python -m spacy download ja_core_news_sm
|
|
> python -m spacy download pl_core_news_sm
|
|
> python -m spacy download ro_core_news_sm
|
|
> python -m spacy download zh_core_web_sm
|
|
> ```
|
|
|
|
With new model families for Chinese, Danish, Polish, Romanian and Chinese plus
|
|
`md` and `lg` models with word vectors for all languages, this release provides
|
|
a total of 46 model packages. For models trained using
|
|
[Universal Dependencies](https://universaldependencies.org) corpora, the
|
|
training data has been updated to UD v2.5 (v2.6 for Japanese, v2.3 for Polish)
|
|
and Dutch has been extended to include both UD Dutch Alpino and LassySmall.
|
|
|
|
<Infobox>
|
|
|
|
**Models:** [Models directory](/models) **Benchmarks: **
|
|
[Release notes](https://github.com/explosion/spaCy/releases/tag/v2.3.0)
|
|
|
|
</Infobox>
|
|
|
|
### Chinese {id="chinese"}
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> from spacy.lang.zh import Chinese
|
|
>
|
|
> # Load with "default" model provided by pkuseg
|
|
> cfg = {"pkuseg_model": "default", "require_pkuseg": True}
|
|
> nlp = Chinese(meta={"tokenizer": {"config": cfg}})
|
|
>
|
|
> # Append words to user dict
|
|
> nlp.tokenizer.pkuseg_update_user_dict(["中国", "ABC"])
|
|
> ```
|
|
|
|
This release adds support for
|
|
[`pkuseg`](https://github.com/lancopku/pkuseg-python) for word segmentation and
|
|
the new Chinese models ship with a custom pkuseg model trained on OntoNotes. The
|
|
Chinese tokenizer can be initialized with both `pkuseg` and custom models and
|
|
the `pkuseg` user dictionary is easy to customize. Note that
|
|
[`pkuseg`](https://github.com/lancopku/pkuseg-python) doesn't yet ship with
|
|
pre-compiled wheels for Python 3.8. See the
|
|
[usage documentation](/usage/models#chinese) for details on how to install it on
|
|
Python 3.8.
|
|
|
|
<Infobox>
|
|
|
|
**Models:** [Chinese models](/models/zh) **Usage: **
|
|
[Chinese tokenizer usage](/usage/models#chinese)
|
|
|
|
</Infobox>
|
|
|
|
### Japanese {id="japanese"}
|
|
|
|
The updated Japanese language class switches to
|
|
[`SudachiPy`](https://github.com/WorksApplications/SudachiPy) for word
|
|
segmentation and part-of-speech tagging. Using `SudachiPy` greatly simplifies
|
|
installing spaCy for Japanese, which is now possible with a single command:
|
|
`pip install spacy[ja]`.
|
|
|
|
<Infobox>
|
|
|
|
**Models:** [Japanese models](/models/ja) **Usage:**
|
|
[Japanese tokenizer usage](/usage/models#japanese)
|
|
|
|
</Infobox>
|
|
|
|
### Small CLI updates
|
|
|
|
- [`spacy debug-data`](/api/cli#debug-data) provides the coverage of the vectors
|
|
in a base model with `spacy debug-data lang train dev -b base_model`
|
|
- [`spacy evaluate`](/api/cli#evaluate) supports `blank:lg` (e.g.
|
|
`spacy evaluate blank:en dev.json`) to evaluate the tokenization accuracy
|
|
without loading a model
|
|
- [`spacy train`](/api/cli#train) on GPU restricts the CPU timing evaluation to
|
|
the first iteration
|
|
|
|
## Backwards incompatibilities {id="incompat"}
|
|
|
|
<Infobox title="Important note on models" variant="warning">
|
|
|
|
If you've been training **your own models**, you'll need to **retrain** them
|
|
with the new version. Also don't forget to upgrade all models to the latest
|
|
versions. Models for earlier v2 releases (v2.0, v2.1, v2.2) aren't compatible
|
|
with models for v2.3. To check if all of your models are up to date, you can run
|
|
the [`spacy validate`](/api/cli#validate) command.
|
|
|
|
</Infobox>
|
|
|
|
> #### Install with lookups data
|
|
>
|
|
> ```bash
|
|
> $ pip install spacy[lookups]
|
|
> ```
|
|
>
|
|
> You can also install
|
|
> [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data)
|
|
> directly.
|
|
|
|
- If you're training new models, you'll want to install the package
|
|
[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data), which
|
|
now includes both the lemmatization tables (as in v2.2) and the normalization
|
|
tables (new in v2.3). If you're using pretrained models, **nothing changes**,
|
|
because the relevant tables are included in the model packages.
|
|
- Due to the updated Universal Dependencies training data, the fine-grained
|
|
part-of-speech tags will change for many provided language models. The
|
|
coarse-grained part-of-speech tagset remains the same, but the mapping from
|
|
particular fine-grained to coarse-grained tags may show minor differences.
|
|
- For French, Italian, Portuguese and Spanish, the fine-grained part-of-speech
|
|
tagsets contain new merged tags related to contracted forms, such as `ADP_DET`
|
|
for French `"au"`, which maps to UPOS `ADP` based on the head `"à"`. This
|
|
increases the accuracy of the models by improving the alignment between
|
|
spaCy's tokenization and Universal Dependencies multi-word tokens used for
|
|
contractions.
|
|
|
|
### Migrating from spaCy 2.2 {id="migrating"}
|
|
|
|
#### Tokenizer settings
|
|
|
|
In spaCy v2.2.2-v2.2.4, there was a change to the precedence of `token_match`
|
|
that gave prefixes and suffixes priority over `token_match`, which caused
|
|
problems for many custom tokenizer configurations. This has been reverted in
|
|
v2.3 so that `token_match` has priority over prefixes and suffixes as in v2.2.1
|
|
and earlier versions.
|
|
|
|
A new tokenizer setting `url_match` has been introduced in v2.3.0 to handle
|
|
cases like URLs where the tokenizer should remove prefixes and suffixes (e.g., a
|
|
comma at the end of a URL) before applying the match. See the full
|
|
[tokenizer documentation](/usage/linguistic-features#tokenization) and try out
|
|
[`nlp.tokenizer.explain()`](/usage/linguistic-features#tokenizer-debug) when
|
|
debugging your tokenizer configuration.
|
|
|
|
#### Warnings configuration
|
|
|
|
spaCy's custom warnings have been replaced with native Python
|
|
[`warnings`](https://docs.python.org/3/library/warnings.html). Instead of
|
|
setting `SPACY_WARNING_IGNORE`, use the
|
|
[`warnings` filters](https://docs.python.org/3/library/warnings.html#the-warnings-filter)
|
|
to manage warnings.
|
|
|
|
```diff
|
|
import spacy
|
|
+ import warnings
|
|
|
|
- spacy.errors.SPACY_WARNING_IGNORE.append('W007')
|
|
+ warnings.filterwarnings("ignore", message=r"\\[W007\\]", category=UserWarning)
|
|
```
|
|
|
|
#### Normalization tables
|
|
|
|
The normalization tables have moved from the language data in
|
|
[`spacy/lang`](https://github.com/explosion/spacy/tree/v2.x/spacy/lang) to the
|
|
package [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data).
|
|
If you're adding data for a new language, the normalization table should be
|
|
added to `spacy-lookups-data`. See
|
|
[adding norm exceptions](/usage/adding-languages#norm-exceptions).
|
|
|
|
#### No preloaded vocab for models with vectors
|
|
|
|
To reduce the initial loading time, the lexemes in `nlp.vocab` are no longer
|
|
loaded on initialization for models with vectors. As you process texts, the
|
|
lexemes will be added to the vocab automatically, just as in small models
|
|
without vectors.
|
|
|
|
To see the number of unique vectors and number of words with vectors, see
|
|
`nlp.meta['vectors']`, for example for `en_core_web_md` there are `20000` unique
|
|
vectors and `684830` words with vectors:
|
|
|
|
```python
|
|
{
|
|
'width': 300,
|
|
'vectors': 20000,
|
|
'keys': 684830,
|
|
'name': 'en_core_web_md.vectors'
|
|
}
|
|
```
|
|
|
|
If required, for instance if you are working directly with word vectors rather
|
|
than processing texts, you can load all lexemes for words with vectors at once:
|
|
|
|
```python
|
|
for orth in nlp.vocab.vectors:
|
|
_ = nlp.vocab[orth]
|
|
```
|
|
|
|
If your workflow previously iterated over `nlp.vocab`, a similar alternative is
|
|
to iterate over words with vectors instead:
|
|
|
|
```diff
|
|
- lexemes = [w for w in nlp.vocab]
|
|
+ lexemes = [nlp.vocab[orth] for orth in nlp.vocab.vectors]
|
|
```
|
|
|
|
Be aware that the set of preloaded lexemes in a v2.2 model is not equivalent to
|
|
the set of words with vectors. For English, v2.2 `md/lg` models have 1.3M
|
|
provided lexemes but only 685K words with vectors. The vectors have been updated
|
|
for most languages in v2.2, but the English models contain the same vectors for
|
|
both v2.2 and v2.3.
|
|
|
|
#### Lexeme.is_oov and Token.is_oov
|
|
|
|
<Infobox title="Important note" variant="warning">
|
|
|
|
Due to a bug, the values for `is_oov` are reversed in v2.3.0, but this will be
|
|
fixed in the next patch release v2.3.1.
|
|
|
|
</Infobox>
|
|
|
|
In v2.3, `Lexeme.is_oov` and `Token.is_oov` are `True` if the lexeme does not
|
|
have a word vector. This is equivalent to `token.orth not in nlp.vocab.vectors`.
|
|
|
|
Previously in v2.2, `is_oov` corresponded to whether a lexeme had stored
|
|
probability and cluster features. The probability and cluster features are no
|
|
longer included in the provided medium and large models (see the next section).
|
|
|
|
#### Probability and cluster features
|
|
|
|
> #### Load and save extra prob lookups table
|
|
>
|
|
> ```python
|
|
> from spacy.lang.en import English
|
|
> nlp = English()
|
|
> doc = nlp("the")
|
|
> print(doc[0].prob) # lazily loads extra prob table
|
|
> nlp.to_disk("/path/to/model") # includes prob table
|
|
> ```
|
|
|
|
The `Token.prob` and `Token.cluster` features, which are no longer used by the
|
|
core pipeline components as of spaCy v2, are no longer provided in the
|
|
pretrained models to reduce the model size. To keep these features available for
|
|
users relying on them, the `prob` and `cluster` features for the most frequent
|
|
1M tokens have been moved to
|
|
[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) as
|
|
`extra` features for the relevant languages (English, German, Greek and
|
|
Spanish).
|
|
|
|
The extra tables are loaded lazily, so if you have `spacy-lookups-data`
|
|
installed and your code accesses `Token.prob`, the full table is loaded into the
|
|
model vocab, which will take a few seconds on initial loading. When you save
|
|
this model after loading the `prob` table, the full `prob` table will be saved
|
|
as part of the model vocab.
|
|
|
|
To load the probability table into a provided model, first make sure you have
|
|
`spacy-lookups-data` installed. To load the table, remove the empty provided
|
|
`lexeme_prob` table and then access `Lexeme.prob` for any word to load the table
|
|
from `spacy-lookups-data`:
|
|
|
|
```diff
|
|
+ # prerequisite: pip install spacy-lookups-data
|
|
import spacy
|
|
|
|
nlp = spacy.load("en_core_web_md")
|
|
|
|
# remove the empty placeholder prob table
|
|
+ if nlp.vocab.lookups_extra.has_table("lexeme_prob"):
|
|
+ nlp.vocab.lookups_extra.remove_table("lexeme_prob")
|
|
|
|
# access any `.prob` to load the full table into the model
|
|
assert nlp.vocab["a"].prob == -3.9297883511
|
|
|
|
# if desired, save this model with the probability table included
|
|
nlp.to_disk("/path/to/model")
|
|
```
|
|
|
|
If you'd like to include custom `cluster`, `prob`, or `sentiment` tables as part
|
|
of a new model, add the data to
|
|
[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) under
|
|
the entry point `lg_extra`, e.g. `en_extra` for English. Alternatively, you can
|
|
initialize your [`Vocab`](/api/vocab) with the `lookups_extra` argument with a
|
|
[`Lookups`](/api/lookups) object that includes the tables `lexeme_cluster`,
|
|
`lexeme_prob`, `lexeme_sentiment` or `lexeme_settings`. `lexeme_settings` is
|
|
currently only used to provide a custom `oov_prob`. See examples in the
|
|
[`data` directory](https://github.com/explosion/spacy-lookups-data/tree/master/spacy_lookups_data/data)
|
|
in `spacy-lookups-data`.
|
|
|
|
#### Initializing new models without extra lookups tables
|
|
|
|
When you initialize a new model with [`spacy init-model`](/api/cli#init-model),
|
|
the `prob` table from `spacy-lookups-data` may be loaded as part of the
|
|
initialization. If you'd like to omit this extra data as in spaCy's provided
|
|
v2.3 models, use the new flag `--omit-extra-lookups`.
|
|
|
|
#### Tag maps in provided models vs. blank models
|
|
|
|
The tag maps in the provided models may differ from the tag maps in the spaCy
|
|
library. You can access the tag map in a loaded model under
|
|
`nlp.vocab.morphology.tag_map`.
|
|
|
|
The tag map from `spacy.lang.lg.tag_map` is still used when a blank model is
|
|
initialized. If you want to provide an alternate tag map, update
|
|
`nlp.vocab.morphology.tag_map` after initializing the model or if you're using
|
|
the [train CLI](/api/cli#train), you can use the new `--tag-map-path` option to
|
|
provide in the tag map as a JSON dict.
|
|
|
|
If you want to export a tag map from a provided model for use with the train
|
|
CLI, you can save it as a JSON dict. To only use string keys as required by JSON
|
|
and to make it easier to read and edit, any internal integer IDs need to be
|
|
converted back to strings:
|
|
|
|
```python
|
|
import spacy
|
|
import srsly
|
|
|
|
nlp = spacy.load("en_core_web_sm")
|
|
tag_map = {}
|
|
|
|
# convert any integer IDs to strings for JSON
|
|
for tag, morph in nlp.vocab.morphology.tag_map.items():
|
|
tag_map[tag] = {}
|
|
for feat, val in morph.items():
|
|
feat = nlp.vocab.strings.as_string(feat)
|
|
if not isinstance(val, bool):
|
|
val = nlp.vocab.strings.as_string(val)
|
|
tag_map[tag][feat] = val
|
|
|
|
srsly.write_json("tag_map.json", tag_map)
|
|
```
|