mirror of
https://github.com/explosion/spaCy.git
synced 2024-12-24 17:06:29 +03:00
554df9ef20
* Rename all MDX file to `.mdx`
* Lock current node version (#11885)
* Apply Prettier (#11996)
* Minor website fixes (#11974) [ci skip]
* fix table
* Migrate to Next WEB-17 (#12005)
* Initial commit
* Run `npx create-next-app@13 next-blog`
* Install MDX packages
Following: 77b5f79a4d/packages/next-mdx/readme.md
* Add MDX to Next
* Allow Next to handle `.md` and `.mdx` files.
* Add VSCode extension recommendation
* Disabled TypeScript strict mode for now
* Add prettier
* Apply Prettier to all files
* Make sure to use correct Node version
* Add basic implementation for `MDXRemote`
* Add experimental Rust MDX parser
* Add `/public`
* Add SASS support
* Remove default pages and styling
* Convert to module
This allows to use `import/export` syntax
* Add import for custom components
* Add ability to load plugins
* Extract function
This will make the next commit easier to read
* Allow to handle directories for page creation
* Refactoring
* Allow to parse subfolders for pages
* Extract logic
* Redirect `index.mdx` to parent directory
* Disabled ESLint during builds
* Disabled typescript during build
* Remove Gatsby from `README.md`
* Rephrase Docker part of `README.md`
* Update project structure in `README.md`
* Move and rename plugins
* Update plugin for wrapping sections
* Add dependencies for plugin
* Use plugin
* Rename wrapper type
* Simplify unnessary adding of id to sections
The slugified section ids are useless, because they can not be referenced anywhere anyway. The navigation only works if the section has the same id as the heading.
* Add plugin for custom attributes on Markdown elements
* Add plugin to readd support for tables
* Add plugin to fix problem with wrapped images
For more details see this issue: https://github.com/mdx-js/mdx/issues/1798
* Add necessary meta data to pages
* Install necessary dependencies
* Remove outdated MDX handling
* Remove reliance on `InlineList`
* Use existing Remark components
* Remove unallowed heading
Before `h1` components where not overwritten and would never have worked and they aren't used anywhere either.
* Add missing components to MDX
* Add correct styling
* Fix broken list
* Fix broken CSS classes
* Implement layout
* Fix links
* Fix broken images
* Fix pattern image
* Fix heading attributes
* Rename heading attribute
`new` was causing some weird issue, so renaming it to `version`
* Update comment syntax in MDX
* Merge imports
* Fix markdown rendering inside components
* Add model pages
* Simplify anchors
* Fix default value for theme
* Add Universe index page
* Add Universe categories
* Add Universe projects
* Fix Next problem with copy
Next complains when the server renders something different then the client, therfor we move the differing logic to `useEffect`
* Fix improper component nesting
Next doesn't allow block elements inside a `<p>`
* Replace landing page MDX with page component
* Remove inlined iframe content
* Remove ability to inline HTML content in iFrames
* Remove MDX imports
* Fix problem with image inside link in MDX
* Escape character for MDX
* Fix unescaped characters in MDX
* Fix headings with logo
* Allow to export static HTML pages
* Add prebuild script
This command is automatically run by Next
* Replace `svg-loader` with `react-inlinesvg`
`svg-loader` is no longer maintained
* Fix ESLint `react-hooks/exhaustive-deps`
* Fix dropdowns
* Change code language from `cli` to `bash`
* Remove unnessary language `none`
* Fix invalid code language
`markdown_` with an underscore was used to basically turn of syntax highlighting, but using unknown languages know throws an error.
* Enable code blocks plugin
* Readd `InlineCode` component
MDX2 removed the `inlineCode` component
> The special component name `inlineCode` was removed, we recommend to use `pre` for the block version of code, and code for both the block and inline versions
Source: https://mdxjs.com/migrating/v2/#update-mdx-content
* Remove unused code
* Extract function to own file
* Fix code syntax highlighting
* Update syntax for code block meta data
* Remove unused prop
* Fix internal link recognition
There is a problem with regex between Node and browser, and since Next runs the component on both, this create an error.
`Prop `rel` did not match. Server: "null" Client: "noopener nofollow noreferrer"`
This simplifies the implementation and fixes the above error.
* Replace `react-helmet` with `next/head`
* Fix `className` problem for JSX component
* Fix broken bold markdown
* Convert file to `.mjs` to be used by Node process
* Add plugin to replace strings
* Fix custom table row styling
* Fix problem with `span` inside inline `code`
React doesn't allow a `span` inside an inline `code` element and throws an error in dev mode.
* Add `_document` to be able to customize `<html>` and `<body>`
* Add `lang="en"`
* Store Netlify settings in file
This way we don't need to update via Netlify UI, which can be tricky if changing build settings.
* Add sitemap
* Add Smartypants
* Add PWA support
* Add `manifest.webmanifest`
* Fix bug with anchor links after reloading
There was no need for the previous implementation, since the browser handles this nativly. Additional the manual scrolling into view was actually broken, because the heading would disappear behind the menu bar.
* Rename custom event
I was googeling for ages to find out what kind of event `inview` is, only to figure out it was a custom event with a name that sounds pretty much like a native one. 🫠
* Fix missing comment syntax highlighting
* Refactor Quickstart component
The previous implementation was hidding the irrelevant lines via data-props and dynamically generated CSS. This created problems with Next and was also hard to follow. CSS was used to do what React is supposed to handle.
The new implementation simplfy filters the list of children (React elements) via their props.
* Fix syntax highlighting for Training Quickstart
* Unify code rendering
* Improve error logging in Juniper
* Fix Juniper component
* Automatically generate "Read Next" link
* Add Plausible
* Use recent DocSearch component and adjust styling
* Fix images
* Turn of image optimization
> Image Optimization using Next.js' default loader is not compatible with `next export`.
We currently deploy to Netlify via `next export`
* Dont build pages starting with `_`
* Remove unused files
* Add Next plugin to Netlify
* Fix button layout
MDX automatically adds `p` tags around text on a new line and Prettier wants to put the text on a new line. Hacking with JSX string.
* Add 404 page
* Apply Prettier
* Update Prettier for `package.json`
Next sometimes wants to patch `package-lock.json`. The old Prettier setting indended with 4 spaces, but Next always indends with 2 spaces. Since `npm install` automatically uses the indendation from `package.json` for `package-lock.json` and to avoid the format switching back and forth, both files are now set to 2 spaces.
* Apply Next patch to `package-lock.json`
When starting the dev server Next would warn `warn - Found lockfile missing swc dependencies, patching...` and update the `package-lock.json`. These are the patched changes.
* fix link
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* small backslash fixes
* adjust to new style
Co-authored-by: Marcus Blättermann <marcus@essenmitsosse.de>
567 lines
22 KiB
Plaintext
567 lines
22 KiB
Plaintext
---
|
|
title: Models & Languages
|
|
next: usage/facts-figures
|
|
menu:
|
|
- ['Quickstart', 'quickstart']
|
|
- ['Language Support', 'languages']
|
|
- ['Installation & Usage', 'download']
|
|
- ['Production Use', 'production']
|
|
---
|
|
|
|
spaCy's trained pipelines can be installed as **Python packages**. This means
|
|
that they're a component of your application, just like any other module.
|
|
They're versioned and can be defined as a dependency in your `requirements.txt`.
|
|
Trained pipelines can be installed from a download URL or a local directory,
|
|
manually or via [pip](https://pypi.python.org/pypi/pip). Their data can be
|
|
located anywhere on your file system.
|
|
|
|
> #### Important note
|
|
>
|
|
> If you're upgrading to spaCy v3.x, you need to **download the new pipeline
|
|
> packages**. If you've trained your own pipelines, you need to **retrain** them
|
|
> after updating spaCy.
|
|
|
|
## Quickstart {hidden="true"}
|
|
|
|
<QuickstartModels
|
|
title="Quickstart"
|
|
id="quickstart"
|
|
description="Install a default trained pipeline package, get the code to load it from within spaCy and an example to test it. For more options, see the section on available packages below."
|
|
/>
|
|
|
|
### Usage note
|
|
|
|
> If lemmatization rules are available for your language, make sure to install
|
|
> spaCy with the `lookups` option, or install
|
|
> [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data)
|
|
> separately in the same environment:
|
|
>
|
|
> ```bash
|
|
> $ pip install -U %%SPACY_PKG_NAME[lookups]%%SPACY_PKG_FLAGS
|
|
> ```
|
|
|
|
If a trained pipeline is available for a language, you can download it using the
|
|
[`spacy download`](/api/cli#download) command as shown above. In order to use
|
|
languages that don't yet come with a trained pipeline, you have to import them
|
|
directly, or use [`spacy.blank`](/api/top-level#spacy.blank):
|
|
|
|
```python
|
|
from spacy.lang.yo import Yoruba
|
|
nlp = Yoruba() # use directly
|
|
nlp = spacy.blank("yo") # blank instance
|
|
```
|
|
|
|
A blank pipeline is typically just a tokenizer. You might want to create a blank
|
|
pipeline when you only need a tokenizer, when you want to add more components
|
|
from scratch, or for testing purposes. Initializing the language object directly
|
|
yields the same result as generating it using `spacy.blank()`. In both cases the
|
|
default configuration for the chosen language is loaded, and no pretrained
|
|
components will be available.
|
|
|
|
## Language support {id="languages"}
|
|
|
|
spaCy currently provides support for the following languages. You can help by
|
|
improving the existing [language data](/usage/linguistic-features#language-data)
|
|
and extending the tokenization patterns.
|
|
[See here](https://github.com/explosion/spaCy/issues/3056) for details on how to
|
|
contribute to development. Also see the
|
|
[training documentation](/usage/training) for how to train your own pipelines on
|
|
your data.
|
|
|
|
<Languages />
|
|
|
|
### Multi-language support {id="multi-language",version="2"}
|
|
|
|
> ```python
|
|
> # Standard import
|
|
> from spacy.lang.xx import MultiLanguage
|
|
> nlp = MultiLanguage()
|
|
>
|
|
> # With lazy-loading
|
|
> nlp = spacy.blank("xx")
|
|
> ```
|
|
|
|
spaCy also supports pipelines trained on more than one language. This is
|
|
especially useful for named entity recognition. The language ID used for
|
|
multi-language or language-neutral pipelines is `xx`. The language class, a
|
|
generic subclass containing only the base language data, can be found in
|
|
[`lang/xx`](%%GITHUB_SPACY/spacy/lang/xx).
|
|
|
|
To train a pipeline using the neutral multi-language class, you can set
|
|
`lang = "xx"` in your [training config](/usage/training#config). You can also
|
|
\import the `MultiLanguage` class directly, or call
|
|
[`spacy.blank("xx")`](/api/top-level#spacy.blank) for lazy-loading.
|
|
|
|
### Chinese language support {id="chinese",version="2.3"}
|
|
|
|
The Chinese language class supports three word segmentation options, `char`,
|
|
`jieba` and `pkuseg`.
|
|
|
|
> #### Manual setup
|
|
>
|
|
> ```python
|
|
> from spacy.lang.zh import Chinese
|
|
>
|
|
> # Character segmentation (default)
|
|
> nlp = Chinese()
|
|
> # Jieba
|
|
> cfg = {"segmenter": "jieba"}
|
|
> nlp = Chinese.from_config({"nlp": {"tokenizer": cfg}})
|
|
> # PKUSeg with "mixed" model provided by pkuseg
|
|
> cfg = {"segmenter": "pkuseg"}
|
|
> nlp = Chinese.from_config({"nlp": {"tokenizer": cfg}})
|
|
> nlp.tokenizer.initialize(pkuseg_model="mixed")
|
|
> ```
|
|
|
|
```ini {title="config.cfg"}
|
|
[nlp.tokenizer]
|
|
@tokenizers = "spacy.zh.ChineseTokenizer"
|
|
segmenter = "char"
|
|
```
|
|
|
|
| Segmenter | Description |
|
|
| --------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
|
| `char` | **Character segmentation:** Character segmentation is the default segmentation option. It's enabled when you create a new `Chinese` language class or call `spacy.blank("zh")`. |
|
|
| `jieba` | **Jieba:** to use [Jieba](https://github.com/fxsjy/jieba) for word segmentation, you can set the option `segmenter` to `"jieba"`. |
|
|
| `pkuseg` | **PKUSeg**: As of spaCy v2.3.0, support for [PKUSeg](https://github.com/explosion/spacy-pkuseg) has been added to support better segmentation for Chinese OntoNotes and the provided [Chinese pipelines](/models/zh). Enable PKUSeg by setting tokenizer option `segmenter` to `"pkuseg"`. |
|
|
|
|
<Infobox title="Changed in v3.0" variant="warning">
|
|
|
|
In v3.0, the default word segmenter has switched from Jieba to character
|
|
segmentation. Because the `pkuseg` segmenter depends on a model that can be
|
|
loaded from a file, the model is loaded on
|
|
[initialization](/usage/training#config-lifecycle) (typically before training).
|
|
This ensures that your packaged Chinese model doesn't depend on a local path at
|
|
runtime.
|
|
|
|
</Infobox>
|
|
|
|
<Accordion title="Details on spaCy's Chinese API">
|
|
|
|
The `initialize` method for the Chinese tokenizer class supports the following
|
|
config settings for loading `pkuseg` models:
|
|
|
|
| Name | Description |
|
|
| ------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
|
| `pkuseg_model` | Name of a model provided by `spacy-pkuseg` or the path to a local model directory. ~~str~~ |
|
|
| `pkuseg_user_dict` | Optional path to a file with one word per line which overrides the default `pkuseg` user dictionary. Defaults to `"default"`, the default provided dictionary. ~~str~~ |
|
|
|
|
The initialization settings are typically provided in the
|
|
[training config](/usage/training#config) and the data is loaded in before
|
|
training and serialized with the model. This allows you to load the data from a
|
|
local path and save out your pipeline and config, without requiring the same
|
|
local path at runtime. See the usage guide on the
|
|
[config lifecycle](/usage/training#config-lifecycle) for more background on
|
|
this.
|
|
|
|
```ini {title="config.cfg"}
|
|
[initialize]
|
|
|
|
[initialize.tokenizer]
|
|
pkuseg_model = "/path/to/model"
|
|
pkuseg_user_dict = "default"
|
|
```
|
|
|
|
You can also initialize the tokenizer for a blank language class by calling its
|
|
`initialize` method:
|
|
|
|
```python {title="Examples"}
|
|
# Initialize the pkuseg tokenizer
|
|
cfg = {"segmenter": "pkuseg"}
|
|
nlp = Chinese.from_config({"nlp": {"tokenizer": cfg}})
|
|
|
|
# Load spaCy's OntoNotes model
|
|
nlp.tokenizer.initialize(pkuseg_model="spacy_ontonotes")
|
|
|
|
# Load pkuseg's "news" model
|
|
nlp.tokenizer.initialize(pkuseg_model="news")
|
|
|
|
# Load local model
|
|
nlp.tokenizer.initialize(pkuseg_model="/path/to/pkuseg_model")
|
|
|
|
# Override the user directory
|
|
nlp.tokenizer.initialize(pkuseg_model="spacy_ontonotes", pkuseg_user_dict="/path/to/user_dict")
|
|
```
|
|
|
|
You can also modify the user dictionary on-the-fly:
|
|
|
|
```python
|
|
# Append words to user dict
|
|
nlp.tokenizer.pkuseg_update_user_dict(["中国", "ABC"])
|
|
|
|
# Remove all words from user dict and replace with new words
|
|
nlp.tokenizer.pkuseg_update_user_dict(["中国"], reset=True)
|
|
|
|
# Remove all words from user dict
|
|
nlp.tokenizer.pkuseg_update_user_dict([], reset=True)
|
|
```
|
|
|
|
</Accordion>
|
|
|
|
<Accordion title="Details on trained and custom Chinese pipelines" spaced>
|
|
|
|
The [Chinese pipelines](/models/zh) provided by spaCy include a custom `pkuseg`
|
|
model trained only on
|
|
[Chinese OntoNotes 5.0](https://catalog.ldc.upenn.edu/LDC2013T19), since the
|
|
models provided by `pkuseg` include data restricted to research use. For
|
|
research use, `pkuseg` provides models for several different domains (`"mixed"`
|
|
(equivalent to `"default"` from `pkuseg` packages), `"news"` `"web"`,
|
|
`"medicine"`, `"tourism"`) and for other uses, `pkuseg` provides a simple
|
|
[training API](https://github.com/explosion/spacy-pkuseg/blob/master/readme/readme_english.md#usage):
|
|
|
|
```python
|
|
import spacy_pkuseg as pkuseg
|
|
from spacy.lang.zh import Chinese
|
|
|
|
# Train pkuseg model
|
|
pkuseg.train("train.utf8", "test.utf8", "/path/to/pkuseg_model")
|
|
|
|
# Load pkuseg model in spaCy Chinese tokenizer
|
|
cfg = {"segmenter": "pkuseg"}
|
|
nlp = Chinese.from_config({"nlp": {"tokenizer": cfg}})
|
|
nlp.tokenizer.initialize(pkuseg_model="/path/to/pkuseg_model")
|
|
```
|
|
|
|
</Accordion>
|
|
|
|
### Japanese language support {id="japanese",version="2.3"}
|
|
|
|
> #### Manual setup
|
|
>
|
|
> ```python
|
|
> from spacy.lang.ja import Japanese
|
|
>
|
|
> # Load SudachiPy with split mode A (default)
|
|
> nlp = Japanese()
|
|
> # Load SudachiPy with split mode B
|
|
> cfg = {"split_mode": "B"}
|
|
> nlp = Japanese.from_config({"nlp": {"tokenizer": cfg}})
|
|
> ```
|
|
|
|
The Japanese language class uses
|
|
[SudachiPy](https://github.com/WorksApplications/SudachiPy) for word
|
|
segmentation and part-of-speech tagging. The default Japanese language class and
|
|
the provided Japanese pipelines use SudachiPy split mode `A`. The tokenizer
|
|
config can be used to configure the split mode to `A`, `B` or `C`.
|
|
|
|
```ini {title="config.cfg"}
|
|
[nlp.tokenizer]
|
|
@tokenizers = "spacy.ja.JapaneseTokenizer"
|
|
split_mode = "A"
|
|
```
|
|
|
|
Extra information, such as reading, inflection form, and the SudachiPy
|
|
normalized form, is available in `Token.morph`. For `B` or `C` split modes,
|
|
subtokens are stored in `Doc.user_data["sub_tokens"]`.
|
|
|
|
<Infobox variant="warning">
|
|
|
|
If you run into errors related to `sudachipy`, which is currently under active
|
|
development, we suggest downgrading to `sudachipy==0.4.9`, which is the version
|
|
used for training the current [Japanese pipelines](/models/ja).
|
|
|
|
</Infobox>
|
|
|
|
### Korean language support {id="korean"}
|
|
|
|
> #### mecab-ko tokenizer
|
|
>
|
|
> ```python
|
|
> nlp = spacy.blank("ko")
|
|
> ```
|
|
|
|
The default MeCab-based Korean tokenizer requires:
|
|
|
|
- [mecab-ko](https://bitbucket.org/eunjeon/mecab-ko/src/master/README.md)
|
|
- [mecab-ko-dic](https://bitbucket.org/eunjeon/mecab-ko-dic)
|
|
- [natto-py](https://github.com/buruzaemon/natto-py)
|
|
|
|
For some Korean datasets and tasks, the
|
|
[rule-based tokenizer](/usage/linguistic-features#tokenization) is better-suited
|
|
than MeCab. To configure a Korean pipeline with the rule-based tokenizer:
|
|
|
|
> #### Rule-based tokenizer
|
|
>
|
|
> ```python
|
|
> config = {"nlp": {"tokenizer": {"@tokenizers": "spacy.Tokenizer.v1"}}}
|
|
> nlp = spacy.blank("ko", config=config)
|
|
> ```
|
|
|
|
```ini {title="config.cfg"}
|
|
[nlp]
|
|
lang = "ko"
|
|
tokenizer = {"@tokenizers" = "spacy.Tokenizer.v1"}
|
|
```
|
|
|
|
<Infobox>
|
|
|
|
The [Korean trained pipelines](/models/ko) use the rule-based tokenizer, so no
|
|
additional dependencies are required.
|
|
|
|
</Infobox>
|
|
|
|
## Installing and using trained pipelines {id="download"}
|
|
|
|
The easiest way to download a trained pipeline is via spaCy's
|
|
[`download`](/api/cli#download) command. It takes care of finding the
|
|
best-matching package compatible with your spaCy installation.
|
|
|
|
> #### Important note for v3.0
|
|
>
|
|
> Note that as of spaCy v3.0, shortcut links like `en` that create (potentially
|
|
> brittle) symlinks in your spaCy installation are **deprecated**. To download
|
|
> and load an installed pipeline package, use its full name:
|
|
>
|
|
> ```diff
|
|
> - python -m spacy download en
|
|
> + python -m spacy download en_core_web_sm
|
|
> ```
|
|
>
|
|
> ```diff
|
|
> - nlp = spacy.load("en")
|
|
> + nlp = spacy.load("en_core_web_sm")
|
|
> ```
|
|
|
|
```bash
|
|
# Download best-matching version of a package for your spaCy installation
|
|
$ python -m spacy download en_core_web_sm
|
|
|
|
# Download exact package version
|
|
$ python -m spacy download en_core_web_sm-3.0.0 --direct
|
|
```
|
|
|
|
The download command will [install the package](/usage/models#download-pip) via
|
|
pip and place the package in your `site-packages` directory.
|
|
|
|
```bash
|
|
$ pip install -U %%SPACY_PKG_NAME%%SPACY_PKG_FLAGS
|
|
$ python -m spacy download en_core_web_sm
|
|
```
|
|
|
|
```python
|
|
import spacy
|
|
nlp = spacy.load("en_core_web_sm")
|
|
doc = nlp("This is a sentence.")
|
|
```
|
|
|
|
If you're in a **Jupyter notebook** or similar environment, you can use the `!`
|
|
prefix to
|
|
[execute commands](https://ipython.org/ipython-doc/3/interactive/tutorial.html#system-shell-commands).
|
|
Make sure to **restart your kernel** or runtime after installation (just like
|
|
you would when installing other Python packages) to make sure that the installed
|
|
pipeline package can be found.
|
|
|
|
```bash
|
|
!python -m spacy download en_core_web_sm
|
|
```
|
|
|
|
### Installation via pip {id="download-pip"}
|
|
|
|
To download a trained pipeline directly using
|
|
[pip](https://pypi.python.org/pypi/pip), point `pip install` to the URL or local
|
|
path of the wheel file or archive. Installing the wheel is usually more
|
|
efficient.
|
|
|
|
> #### Pipeline Package URLs {id="pipeline-urls"}
|
|
>
|
|
> Pretrained pipeline distributions are hosted on
|
|
> [Github Releases](https://github.com/explosion/spacy-models/releases), and you
|
|
> can find download links there, as well as on the model page. You can also get
|
|
> URLs directly from the command line by using `spacy info` with the `--url`
|
|
> flag, which may be useful for automation.
|
|
>
|
|
> ```bash
|
|
> spacy info en_core_web_sm --url
|
|
> ```
|
|
>
|
|
> This command will print the URL for the latest version of a pipeline
|
|
> compatible with the version of spaCy you're using. Note that in order to look
|
|
> up the compatibility information an internet connection is required.
|
|
|
|
```bash
|
|
# With external URL
|
|
$ pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.0.0/en_core_web_sm-3.0.0-py3-none-any.whl
|
|
$ pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.0.0/en_core_web_sm-3.0.0.tar.gz
|
|
|
|
# Using spacy info to get the external URL
|
|
$ pip install $(spacy info en_core_web_sm --url)
|
|
|
|
# With local file
|
|
$ pip install /Users/you/en_core_web_sm-3.0.0-py3-none-any.whl
|
|
$ pip install /Users/you/en_core_web_sm-3.0.0.tar.gz
|
|
```
|
|
|
|
By default, this will install the pipeline package into your `site-packages`
|
|
directory. You can then use `spacy.load` to load it via its package name or
|
|
[import it](#usage-import) explicitly as a module. If you need to download
|
|
pipeline packages as part of an automated process, we recommend using pip with a
|
|
direct link, instead of relying on spaCy's [`download`](/api/cli#download)
|
|
command.
|
|
|
|
You can also add the direct download link to your application's
|
|
`requirements.txt`. For more details, see the section on
|
|
[working with pipeline packages in production](#production).
|
|
|
|
### Manual download and installation {id="download-manual"}
|
|
|
|
In some cases, you might prefer downloading the data manually, for example to
|
|
place it into a custom directory. You can download the package via your browser
|
|
from the [latest releases](https://github.com/explosion/spacy-models/releases),
|
|
or configure your own download script using the URL of the archive file. The
|
|
archive consists of a package directory that contains another directory with the
|
|
pipeline data.
|
|
|
|
```yaml {title="Directory structure",highlight="6"}
|
|
└── en_core_web_md-3.0.0.tar.gz # downloaded archive
|
|
├── setup.py # setup file for pip installation
|
|
├── meta.json # copy of pipeline meta
|
|
└── en_core_web_md # 📦 pipeline package
|
|
├── __init__.py # init for pip installation
|
|
└── en_core_web_md-3.0.0 # pipeline data
|
|
├── config.cfg # pipeline config
|
|
├── meta.json # pipeline meta
|
|
└── ... # directories with component data
|
|
```
|
|
|
|
You can place the **pipeline package directory** anywhere on your local file
|
|
system.
|
|
|
|
### Installation from Python {id="download-python"}
|
|
|
|
Since the [`spacy download`](/api/cli#download) command installs the pipeline as
|
|
a **Python package**, we always recommend running it from the command line, just
|
|
like you install other Python packages with `pip install`. However, if you need
|
|
to, or if you want to integrate the download process into another CLI command,
|
|
you can also import and call the `download` function used by the CLI via Python.
|
|
|
|
<Infobox variant="warning">
|
|
|
|
Keep in mind that the `download` command installs a Python package into your
|
|
environment. In order for it to be found after installation, you will need to
|
|
**restart or reload** your Python process so that new packages are recognized.
|
|
|
|
</Infobox>
|
|
|
|
```python
|
|
import spacy
|
|
spacy.cli.download("en_core_web_sm")
|
|
```
|
|
|
|
### Using trained pipelines with spaCy {id="usage"}
|
|
|
|
To load a pipeline package, use [`spacy.load`](/api/top-level#spacy.load) with
|
|
the package name or a path to the data directory:
|
|
|
|
> #### Important note for v3.0
|
|
>
|
|
> Note that as of spaCy v3.0, shortcut links like `en` that create (potentially
|
|
> brittle) symlinks in your spaCy installation are **deprecated**. To download
|
|
> and load an installed pipeline package, use its full name:
|
|
>
|
|
> ```diff
|
|
> - python -m spacy download en
|
|
> + python -m spacy download en_core_web_sm
|
|
> ```
|
|
|
|
```python
|
|
import spacy
|
|
nlp = spacy.load("en_core_web_sm") # load package "en_core_web_sm"
|
|
nlp = spacy.load("/path/to/en_core_web_sm") # load package from a directory
|
|
|
|
doc = nlp("This is a sentence.")
|
|
```
|
|
|
|
<Infobox title="Tip: Preview model info" emoji="💡">
|
|
|
|
You can use the [`info`](/api/cli#info) command or
|
|
[`spacy.info()`](/api/top-level#spacy.info) method to print a pipeline package's
|
|
meta data before loading it. Each `Language` object with a loaded pipeline also
|
|
exposes the pipeline's meta data as the attribute `meta`. For example,
|
|
`nlp.meta['version']` will return the package version.
|
|
|
|
</Infobox>
|
|
|
|
### Importing pipeline packages as modules {id="usage-import"}
|
|
|
|
If you've installed a trained pipeline via [`spacy download`](/api/cli#download)
|
|
or directly via pip, you can also `import` it and then call its `load()` method
|
|
with no arguments:
|
|
|
|
```python {executable="true"}
|
|
import en_core_web_sm
|
|
|
|
nlp = en_core_web_sm.load()
|
|
doc = nlp("This is a sentence.")
|
|
```
|
|
|
|
How you choose to load your trained pipelines ultimately depends on personal
|
|
preference. However, **for larger code bases**, we usually recommend native
|
|
imports, as this will make it easier to integrate pipeline packages with your
|
|
existing build process, continuous integration workflow and testing framework.
|
|
It'll also prevent you from ever trying to load a package that is not installed,
|
|
as your code will raise an `ImportError` immediately, instead of failing
|
|
somewhere down the line when calling `spacy.load()`. For more details, see the
|
|
section on [working with pipeline packages in production](#production).
|
|
|
|
## Using trained pipelines in production {id="production"}
|
|
|
|
If your application depends on one or more trained pipeline packages, you'll
|
|
usually want to integrate them into your continuous integration workflow and
|
|
build process. While spaCy provides a range of useful helpers for downloading
|
|
and loading pipeline packages, the underlying functionality is entirely based on
|
|
native Python packaging. This allows your application to handle a spaCy pipeline
|
|
like any other package dependency.
|
|
|
|
### Downloading and requiring package dependencies {id="models-download"}
|
|
|
|
spaCy's built-in [`download`](/api/cli#download) command is mostly intended as a
|
|
convenient, interactive wrapper. It performs compatibility checks and prints
|
|
detailed error messages and warnings. However, if you're downloading pipeline
|
|
packages as part of an automated build process, this only adds an unnecessary
|
|
layer of complexity. If you know which packages your application needs, you
|
|
should be specifying them directly.
|
|
|
|
Because pipeline packages are valid Python packages, you can add them to your
|
|
application's `requirements.txt`. If you're running your own internal PyPi
|
|
installation, you can upload the pipeline packages there. pip's
|
|
[requirements file format](https://pip.pypa.io/en/latest/reference/requirements-file-format/)
|
|
supports both package names to download via a PyPi server, as well as
|
|
[direct URLs](#pipeline-urls).
|
|
|
|
```text {title="requirements.txt"}
|
|
spacy>=3.0.0,<4.0.0
|
|
en_core_web_sm @ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.4.0/en_core_web_sm-3.4.0-py3-none-any.whl
|
|
```
|
|
|
|
All pipeline packages are versioned and specify their spaCy dependency. This
|
|
ensures cross-compatibility and lets you specify exact version requirements for
|
|
each pipeline. If you've [trained](/usage/training) your own pipeline, you can
|
|
use the [`spacy package`](/api/cli#package) command to generate the required
|
|
meta data and turn it into a loadable package.
|
|
|
|
### Loading and testing pipeline packages {id="models-loading"}
|
|
|
|
Pipeline packages are regular Python packages, so you can also import them as a
|
|
package using Python's native `import` syntax, and then call the `load` method
|
|
to load the data and return an `nlp` object:
|
|
|
|
```python
|
|
import en_core_web_sm
|
|
nlp = en_core_web_sm.load()
|
|
```
|
|
|
|
In general, this approach is recommended for larger code bases, as it's more
|
|
"native", and doesn't rely on spaCy's loader to resolve string names to
|
|
packages. If a package can't be imported, Python will raise an `ImportError`
|
|
immediately. And if a package is imported but not used, any linter will catch
|
|
that.
|
|
|
|
Similarly, it'll give you more flexibility when writing tests that require
|
|
loading pipelines. For example, instead of writing your own `try` and `except`
|
|
logic around spaCy's loader, you can use
|
|
[pytest](http://pytest.readthedocs.io/en/latest/)'s
|
|
[`importorskip()`](https://docs.pytest.org/en/latest/builtin.html#_pytest.outcomes.importorskip)
|
|
method to only run a test if a specific pipeline package or version is
|
|
installed. Each pipeline package exposes a `__version__` attribute which you can
|
|
also use to perform your own version compatibility checks before loading it.
|