mirror of
				https://github.com/explosion/spaCy.git
				synced 2025-10-26 05:31:15 +03:00 
			
		
		
		
	* Init * fix tests * Update spacy/errors.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Fix test_blank_languages * Rename xx to mul in docs * Format _util with black * prettier formatting --------- Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
		
			
				
	
	
		
			571 lines
		
	
	
		
			23 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
			
		
		
	
	
			571 lines
		
	
	
		
			23 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
| ---
 | |
| title: Models & Languages
 | |
| next: usage/facts-figures
 | |
| menu:
 | |
|   - ['Quickstart', 'quickstart']
 | |
|   - ['Language Support', 'languages']
 | |
|   - ['Installation & Usage', 'download']
 | |
|   - ['Production Use', 'production']
 | |
| ---
 | |
| 
 | |
| spaCy's trained pipelines can be installed as **Python packages**. This means
 | |
| that they're a component of your application, just like any other module.
 | |
| They're versioned and can be defined as a dependency in your `requirements.txt`.
 | |
| Trained pipelines can be installed from a download URL or a local directory,
 | |
| manually or via [pip](https://pypi.python.org/pypi/pip). Their data can be
 | |
| located anywhere on your file system.
 | |
| 
 | |
| > #### Important note
 | |
| >
 | |
| > If you're upgrading to spaCy v3.x, you need to **download the new pipeline
 | |
| > packages**. If you've trained your own pipelines, you need to **retrain** them
 | |
| > after updating spaCy.
 | |
| 
 | |
| ## Quickstart {hidden="true"}
 | |
| 
 | |
| <QuickstartModels
 | |
|   title="Quickstart"
 | |
|   id="quickstart"
 | |
|   description="Install a default trained pipeline package, get the code to load it from within spaCy and an example to test it. For more options, see the section on available packages below."
 | |
| />
 | |
| 
 | |
| ### Usage note
 | |
| 
 | |
| > If lemmatization rules are available for your language, make sure to install
 | |
| > spaCy with the `lookups` option, or install
 | |
| > [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data)
 | |
| > separately in the same environment:
 | |
| >
 | |
| > ```bash
 | |
| > $ pip install -U %%SPACY_PKG_NAME[lookups]%%SPACY_PKG_FLAGS
 | |
| > ```
 | |
| 
 | |
| If a trained pipeline is available for a language, you can download it using the
 | |
| [`spacy download`](/api/cli#download) command as shown above. In order to use
 | |
| languages that don't yet come with a trained pipeline, you have to import them
 | |
| directly, or use [`spacy.blank`](/api/top-level#spacy.blank):
 | |
| 
 | |
| ```python
 | |
| from spacy.lang.yo import Yoruba
 | |
| nlp = Yoruba()  # use directly
 | |
| nlp = spacy.blank("yo")  # blank instance
 | |
| ```
 | |
| 
 | |
| A blank pipeline is typically just a tokenizer. You might want to create a blank
 | |
| pipeline when you only need a tokenizer, when you want to add more components
 | |
| from scratch, or for testing purposes. Initializing the language object directly
 | |
| yields the same result as generating it using `spacy.blank()`. In both cases the
 | |
| default configuration for the chosen language is loaded, and no pretrained
 | |
| components will be available.
 | |
| 
 | |
| ## Language support {id="languages"}
 | |
| 
 | |
| spaCy currently provides support for the following languages. You can help by
 | |
| improving the existing [language data](/usage/linguistic-features#language-data)
 | |
| and extending the tokenization patterns.
 | |
| [See here](https://github.com/explosion/spaCy/issues/3056) for details on how to
 | |
| contribute to development. Also see the
 | |
| [training documentation](/usage/training) for how to train your own pipelines on
 | |
| your data.
 | |
| 
 | |
| <Languages />
 | |
| 
 | |
| ### Multi-language support {id="multi-language",version="2"}
 | |
| 
 | |
| > ```python
 | |
| > # Standard import
 | |
| > from spacy.lang.mul import MultiLanguage
 | |
| > nlp = MultiLanguage()
 | |
| >
 | |
| > # With lazy-loading
 | |
| > nlp = spacy.blank("mul")
 | |
| > ```
 | |
| 
 | |
| spaCy also supports pipelines trained on more than one language. This is
 | |
| especially useful for named entity recognition. The language ID used for
 | |
| multi-language or language-neutral pipelines is `mul`. The language class, a
 | |
| generic subclass containing only the base language data, can be found in
 | |
| [`lang/mul`](%%GITHUB_SPACY/spacy/lang/mul).
 | |
| 
 | |
| To train a pipeline using the neutral multi-language class, you can set
 | |
| `lang = "mul"` in your [training config](/usage/training#config). You can also
 | |
| \import the `MultiLanguage` class directly, or call
 | |
| [`spacy.blank("mul")`](/api/top-level#spacy.blank) for lazy-loading.
 | |
| 
 | |
| ### Chinese language support {id="chinese",version="2.3"}
 | |
| 
 | |
| The Chinese language class supports three word segmentation options, `char`,
 | |
| `jieba` and `pkuseg`.
 | |
| 
 | |
| > #### Manual setup
 | |
| >
 | |
| > ```python
 | |
| > from spacy.lang.zh import Chinese
 | |
| >
 | |
| > # Character segmentation (default)
 | |
| > nlp = Chinese()
 | |
| > # Jieba
 | |
| > cfg = {"segmenter": "jieba"}
 | |
| > nlp = Chinese.from_config({"nlp": {"tokenizer": cfg}})
 | |
| > # PKUSeg with "mixed" model provided by pkuseg
 | |
| > cfg = {"segmenter": "pkuseg"}
 | |
| > nlp = Chinese.from_config({"nlp": {"tokenizer": cfg}})
 | |
| > nlp.tokenizer.initialize(pkuseg_model="mixed")
 | |
| > ```
 | |
| 
 | |
| ```ini {title="config.cfg"}
 | |
| [nlp.tokenizer]
 | |
| @tokenizers = "spacy.zh.ChineseTokenizer"
 | |
| segmenter = "char"
 | |
| ```
 | |
| 
 | |
| | Segmenter | Description                                                                                                                                                                                                                                                                                |
 | |
| | --------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
 | |
| | `char`    | **Character segmentation:** Character segmentation is the default segmentation option. It's enabled when you create a new `Chinese` language class or call `spacy.blank("zh")`.                                                                                                            |
 | |
| | `jieba`   | **Jieba:** to use [Jieba](https://github.com/fxsjy/jieba) for word segmentation, you can set the option `segmenter` to `"jieba"`.                                                                                                                                                          |
 | |
| | `pkuseg`  | **PKUSeg**: As of spaCy v2.3.0, support for [PKUSeg](https://github.com/explosion/spacy-pkuseg) has been added to support better segmentation for Chinese OntoNotes and the provided [Chinese pipelines](/models/zh). Enable PKUSeg by setting tokenizer option `segmenter` to `"pkuseg"`. |
 | |
| 
 | |
| <Infobox title="Changed in v3.0" variant="warning">
 | |
| 
 | |
| In v3.0, the default word segmenter has switched from Jieba to character
 | |
| segmentation. Because the `pkuseg` segmenter depends on a model that can be
 | |
| loaded from a file, the model is loaded on
 | |
| [initialization](/usage/training#config-lifecycle) (typically before training).
 | |
| This ensures that your packaged Chinese model doesn't depend on a local path at
 | |
| runtime.
 | |
| 
 | |
| </Infobox>
 | |
| 
 | |
| <Accordion title="Details on spaCy's Chinese API">
 | |
| 
 | |
| The `initialize` method for the Chinese tokenizer class supports the following
 | |
| config settings for loading `pkuseg` models:
 | |
| 
 | |
| | Name               | Description                                                                                                                                                            |
 | |
| | ------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
 | |
| | `pkuseg_model`     | Name of a model provided by `spacy-pkuseg` or the path to a local model directory. ~~str~~                                                                             |
 | |
| | `pkuseg_user_dict` | Optional path to a file with one word per line which overrides the default `pkuseg` user dictionary. Defaults to `"default"`, the default provided dictionary. ~~str~~ |
 | |
| 
 | |
| The initialization settings are typically provided in the
 | |
| [training config](/usage/training#config) and the data is loaded in before
 | |
| training and serialized with the model. This allows you to load the data from a
 | |
| local path and save out your pipeline and config, without requiring the same
 | |
| local path at runtime. See the usage guide on the
 | |
| [config lifecycle](/usage/training#config-lifecycle) for more background on
 | |
| this.
 | |
| 
 | |
| ```ini {title="config.cfg"}
 | |
| [initialize]
 | |
| 
 | |
| [initialize.tokenizer]
 | |
| pkuseg_model = "/path/to/model"
 | |
| pkuseg_user_dict = "default"
 | |
| ```
 | |
| 
 | |
| You can also initialize the tokenizer for a blank language class by calling its
 | |
| `initialize` method:
 | |
| 
 | |
| ```python {title="Examples"}
 | |
| # Initialize the pkuseg tokenizer
 | |
| cfg = {"segmenter": "pkuseg"}
 | |
| nlp = Chinese.from_config({"nlp": {"tokenizer": cfg}})
 | |
| 
 | |
| # Load spaCy's OntoNotes model
 | |
| nlp.tokenizer.initialize(pkuseg_model="spacy_ontonotes")
 | |
| 
 | |
| # Load pkuseg's "news" model
 | |
| nlp.tokenizer.initialize(pkuseg_model="news")
 | |
| 
 | |
| # Load local model
 | |
| nlp.tokenizer.initialize(pkuseg_model="/path/to/pkuseg_model")
 | |
| 
 | |
| # Override the user directory
 | |
| nlp.tokenizer.initialize(pkuseg_model="spacy_ontonotes", pkuseg_user_dict="/path/to/user_dict")
 | |
| ```
 | |
| 
 | |
| You can also modify the user dictionary on-the-fly:
 | |
| 
 | |
| ```python
 | |
| # Append words to user dict
 | |
| nlp.tokenizer.pkuseg_update_user_dict(["中国", "ABC"])
 | |
| 
 | |
| # Remove all words from user dict and replace with new words
 | |
| nlp.tokenizer.pkuseg_update_user_dict(["中国"], reset=True)
 | |
| 
 | |
| # Remove all words from user dict
 | |
| nlp.tokenizer.pkuseg_update_user_dict([], reset=True)
 | |
| ```
 | |
| 
 | |
| </Accordion>
 | |
| 
 | |
| <Accordion title="Details on trained and custom Chinese pipelines" spaced>
 | |
| 
 | |
| The [Chinese pipelines](/models/zh) provided by spaCy include a custom `pkuseg`
 | |
| model trained only on
 | |
| [Chinese OntoNotes 5.0](https://catalog.ldc.upenn.edu/LDC2013T19), since the
 | |
| models provided by `pkuseg` include data restricted to research use. For
 | |
| research use, `pkuseg` provides models for several different domains (`"mixed"`
 | |
| (equivalent to `"default"` from `pkuseg` packages), `"news"` `"web"`,
 | |
| `"medicine"`, `"tourism"`) and for other uses, `pkuseg` provides a simple
 | |
| [training API](https://github.com/explosion/spacy-pkuseg/blob/master/readme/readme_english.md#usage):
 | |
| 
 | |
| ```python
 | |
| import spacy_pkuseg as pkuseg
 | |
| from spacy.lang.zh import Chinese
 | |
| 
 | |
| # Train pkuseg model
 | |
| pkuseg.train("train.utf8", "test.utf8", "/path/to/pkuseg_model")
 | |
| 
 | |
| # Load pkuseg model in spaCy Chinese tokenizer
 | |
| cfg = {"segmenter": "pkuseg"}
 | |
| nlp = Chinese.from_config({"nlp": {"tokenizer": cfg}})
 | |
| nlp.tokenizer.initialize(pkuseg_model="/path/to/pkuseg_model")
 | |
| ```
 | |
| 
 | |
| </Accordion>
 | |
| 
 | |
| ### Japanese language support {id="japanese",version="2.3"}
 | |
| 
 | |
| > #### Manual setup
 | |
| >
 | |
| > ```python
 | |
| > from spacy.lang.ja import Japanese
 | |
| >
 | |
| > # Load SudachiPy with split mode A (default)
 | |
| > nlp = Japanese()
 | |
| > # Load SudachiPy with split mode B
 | |
| > cfg = {"split_mode": "B"}
 | |
| > nlp = Japanese.from_config({"nlp": {"tokenizer": cfg}})
 | |
| > ```
 | |
| 
 | |
| The Japanese language class uses
 | |
| [SudachiPy](https://github.com/WorksApplications/SudachiPy) for word
 | |
| segmentation and part-of-speech tagging. The default Japanese language class and
 | |
| the provided Japanese pipelines use SudachiPy split mode `A`. The tokenizer
 | |
| config can be used to configure the split mode to `A`, `B` or `C`.
 | |
| 
 | |
| ```ini {title="config.cfg"}
 | |
| [nlp.tokenizer]
 | |
| @tokenizers = "spacy.ja.JapaneseTokenizer"
 | |
| split_mode = "A"
 | |
| ```
 | |
| 
 | |
| Extra information, such as reading, inflection form, and the SudachiPy
 | |
| normalized form, is available in `Token.morph`. For `B` or `C` split modes,
 | |
| subtokens are stored in `Doc.user_data["sub_tokens"]`.
 | |
| 
 | |
| <Infobox variant="warning">
 | |
| 
 | |
| If you run into errors related to `sudachipy`, which is currently under active
 | |
| development, we suggest downgrading to `sudachipy==0.4.9`, which is the version
 | |
| used for training the current [Japanese pipelines](/models/ja).
 | |
| 
 | |
| </Infobox>
 | |
| 
 | |
| ### Korean language support {id="korean"}
 | |
| 
 | |
| There are currently three built-in options for Korean tokenization, two based on
 | |
| [mecab-ko](https://bitbucket.org/eunjeon/mecab-ko/src/master/README.md) and one
 | |
| using the rule-based tokenizer.
 | |
| 
 | |
| > #### Default mecab-ko tokenizer
 | |
| >
 | |
| > ```python
 | |
| > # uses mecab-ko-dic
 | |
| > nlp = spacy.blank("ko")
 | |
| >
 | |
| > # with custom mecab args
 | |
| > mecab_args = "-d /path/to/dicdir -u /path/to/userdic"
 | |
| > config = {"nlp": {"tokenizer": {"mecab_args": mecab_args}}}
 | |
| > nlp = spacy.blank("ko", config=config)
 | |
| > ```
 | |
| 
 | |
| The default MeCab-based Korean tokenizer requires the python package
 | |
| [`mecab-ko`](https://pypi.org/project/mecab-ko/) and no further system
 | |
| requirements.
 | |
| 
 | |
| The `natto-py` MeCab-based tokenizer (the previous default for spaCy v3.4 and
 | |
| earlier) is available as `spacy.KoreanNattoTokenizer.v1`. It requires:
 | |
| 
 | |
| - [mecab-ko](https://bitbucket.org/eunjeon/mecab-ko/src/master/README.md)
 | |
| - [mecab-ko-dic](https://bitbucket.org/eunjeon/mecab-ko-dic)
 | |
| - [natto-py](https://github.com/buruzaemon/natto-py)
 | |
| 
 | |
| To use this tokenizer, edit `[nlp.tokenizer]` in your config:
 | |
| 
 | |
| > #### natto-py MeCab-ko tokenizer
 | |
| >
 | |
| > ```python
 | |
| > config = {"nlp": {"tokenizer": {"@tokenizers": "spacy.KoreanNattoTokenizer.v1"}}}
 | |
| > nlp = spacy.blank("ko", config=config)
 | |
| > ```
 | |
| 
 | |
| ```ini
 | |
| ### config.cfg
 | |
| [nlp]
 | |
| lang = "ko"
 | |
| tokenizer = {"@tokenizers" = "spacy.KoreanNattoTokenizer.v1"}
 | |
| ```
 | |
| 
 | |
| For some Korean datasets and tasks, the
 | |
| [rule-based tokenizer](/usage/linguistic-features#tokenization) is better-suited
 | |
| than MeCab. To configure a Korean pipeline with the rule-based tokenizer:
 | |
| 
 | |
| > #### Rule-based tokenizer
 | |
| >
 | |
| > ```python
 | |
| > config = {"nlp": {"tokenizer": {"@tokenizers": "spacy.Tokenizer.v1"}}}
 | |
| > nlp = spacy.blank("ko", config=config)
 | |
| > ```
 | |
| 
 | |
| ```ini {title="config.cfg"}
 | |
| [nlp]
 | |
| lang = "ko"
 | |
| tokenizer = {"@tokenizers" = "spacy.Tokenizer.v1"}
 | |
| ```
 | |
| 
 | |
| <Infobox>
 | |
| 
 | |
| The [Korean trained pipelines](/models/ko) use the rule-based tokenizer, so no
 | |
| additional dependencies are required.
 | |
| 
 | |
| </Infobox>
 | |
| 
 | |
| ## Installing and using trained pipelines {id="download"}
 | |
| 
 | |
| The easiest way to download a trained pipeline is via spaCy's
 | |
| [`download`](/api/cli#download) command. It takes care of finding the
 | |
| best-matching package compatible with your spaCy installation.
 | |
| 
 | |
| ```bash
 | |
| # Download best-matching version of a package for your spaCy installation
 | |
| $ python -m spacy download en_core_web_sm
 | |
| 
 | |
| # Download exact package version
 | |
| $ python -m spacy download en_core_web_sm-3.0.0 --direct
 | |
| ```
 | |
| 
 | |
| The download command will [install the package](/usage/models#download-pip) via
 | |
| pip and place the package in your `site-packages` directory.
 | |
| 
 | |
| ```bash
 | |
| $ pip install -U %%SPACY_PKG_NAME%%SPACY_PKG_FLAGS
 | |
| $ python -m spacy download en_core_web_sm
 | |
| ```
 | |
| 
 | |
| ```python
 | |
| import spacy
 | |
| nlp = spacy.load("en_core_web_sm")
 | |
| doc = nlp("This is a sentence.")
 | |
| ```
 | |
| 
 | |
| If you're in a **Jupyter notebook** or similar environment, you can use the `!`
 | |
| prefix to
 | |
| [execute commands](https://ipython.org/ipython-doc/3/interactive/tutorial.html#system-shell-commands).
 | |
| Make sure to **restart your kernel** or runtime after installation (just like
 | |
| you would when installing other Python packages) to make sure that the installed
 | |
| pipeline package can be found.
 | |
| 
 | |
| ```bash
 | |
| !python -m spacy download en_core_web_sm
 | |
| ```
 | |
| 
 | |
| ### Installation via pip {id="download-pip"}
 | |
| 
 | |
| To download a trained pipeline directly using
 | |
| [pip](https://pypi.python.org/pypi/pip), point `pip install` to the URL or local
 | |
| path of the wheel file or archive. Installing the wheel is usually more
 | |
| efficient.
 | |
| 
 | |
| > #### Pipeline Package URLs {id="pipeline-urls"}
 | |
| >
 | |
| > Pretrained pipeline distributions are hosted on
 | |
| > [Github Releases](https://github.com/explosion/spacy-models/releases), and you
 | |
| > can find download links there, as well as on the model page. You can also get
 | |
| > URLs directly from the command line by using `spacy info` with the `--url`
 | |
| > flag, which may be useful for automation.
 | |
| >
 | |
| > ```bash
 | |
| > spacy info en_core_web_sm --url
 | |
| > ```
 | |
| >
 | |
| > This command will print the URL for the latest version of a pipeline
 | |
| > compatible with the version of spaCy you're using. Note that in order to look
 | |
| > up the compatibility information an internet connection is required.
 | |
| 
 | |
| ```bash
 | |
| # With external URL
 | |
| $ pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.0.0/en_core_web_sm-3.0.0-py3-none-any.whl
 | |
| $ pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.0.0/en_core_web_sm-3.0.0.tar.gz
 | |
| 
 | |
| # Using spacy info to get the external URL
 | |
| $ pip install $(spacy info en_core_web_sm --url)
 | |
| 
 | |
| # With local file
 | |
| $ pip install /Users/you/en_core_web_sm-3.0.0-py3-none-any.whl
 | |
| $ pip install /Users/you/en_core_web_sm-3.0.0.tar.gz
 | |
| ```
 | |
| 
 | |
| By default, this will install the pipeline package into your `site-packages`
 | |
| directory. You can then use `spacy.load` to load it via its package name or
 | |
| [import it](#usage-import) explicitly as a module. If you need to download
 | |
| pipeline packages as part of an automated process, we recommend using pip with a
 | |
| direct link, instead of relying on spaCy's [`download`](/api/cli#download)
 | |
| command.
 | |
| 
 | |
| You can also add the direct download link to your application's
 | |
| `requirements.txt`. For more details, see the section on
 | |
| [working with pipeline packages in production](#production).
 | |
| 
 | |
| ### Manual download and installation {id="download-manual"}
 | |
| 
 | |
| In some cases, you might prefer downloading the data manually, for example to
 | |
| place it into a custom directory. You can download the package via your browser
 | |
| from the [latest releases](https://github.com/explosion/spacy-models/releases),
 | |
| or configure your own download script using the URL of the archive file. The
 | |
| archive consists of a package directory that contains another directory with the
 | |
| pipeline data.
 | |
| 
 | |
| ```yaml {title="Directory structure",highlight="6"}
 | |
| └── en_core_web_md-3.0.0.tar.gz       # downloaded archive
 | |
|     ├── setup.py                      # setup file for pip installation
 | |
|     ├── meta.json                     # copy of pipeline meta
 | |
|     └── en_core_web_md                # 📦 pipeline package
 | |
|         ├── __init__.py               # init for pip installation
 | |
|         └── en_core_web_md-3.0.0      # pipeline data
 | |
|             ├── config.cfg            # pipeline config
 | |
|             ├── meta.json             # pipeline meta
 | |
|             └── ...                   # directories with component data
 | |
| ```
 | |
| 
 | |
| You can place the **pipeline package directory** anywhere on your local file
 | |
| system.
 | |
| 
 | |
| ### Installation from Python {id="download-python"}
 | |
| 
 | |
| Since the [`spacy download`](/api/cli#download) command installs the pipeline as
 | |
| a **Python package**, we always recommend running it from the command line, just
 | |
| like you install other Python packages with `pip install`. However, if you need
 | |
| to, or if you want to integrate the download process into another CLI command,
 | |
| you can also import and call the `download` function used by the CLI via Python.
 | |
| 
 | |
| <Infobox variant="warning">
 | |
| 
 | |
| Keep in mind that the `download` command installs a Python package into your
 | |
| environment. In order for it to be found after installation, you will need to
 | |
| **restart or reload** your Python process so that new packages are recognized.
 | |
| 
 | |
| </Infobox>
 | |
| 
 | |
| ```python
 | |
| import spacy
 | |
| spacy.cli.download("en_core_web_sm")
 | |
| ```
 | |
| 
 | |
| ### Using trained pipelines with spaCy {id="usage"}
 | |
| 
 | |
| To load a pipeline package, use [`spacy.load`](/api/top-level#spacy.load) with
 | |
| the package name or a path to the data directory:
 | |
| 
 | |
| ```python
 | |
| import spacy
 | |
| nlp = spacy.load("en_core_web_sm")           # load package "en_core_web_sm"
 | |
| nlp = spacy.load("/path/to/en_core_web_sm")  # load package from a directory
 | |
| 
 | |
| doc = nlp("This is a sentence.")
 | |
| ```
 | |
| 
 | |
| <Infobox title="Tip: Preview model info" emoji="💡">
 | |
| 
 | |
| You can use the [`info`](/api/cli#info) command or
 | |
| [`spacy.info()`](/api/top-level#spacy.info) method to print a pipeline package's
 | |
| meta data before loading it. Each `Language` object with a loaded pipeline also
 | |
| exposes the pipeline's meta data as the attribute `meta`. For example,
 | |
| `nlp.meta['version']` will return the package version.
 | |
| 
 | |
| </Infobox>
 | |
| 
 | |
| ### Importing pipeline packages as modules {id="usage-import"}
 | |
| 
 | |
| If you've installed a trained pipeline via [`spacy download`](/api/cli#download)
 | |
| or directly via pip, you can also `import` it and then call its `load()` method
 | |
| with no arguments:
 | |
| 
 | |
| ```python {executable="true"}
 | |
| import en_core_web_sm
 | |
| 
 | |
| nlp = en_core_web_sm.load()
 | |
| doc = nlp("This is a sentence.")
 | |
| ```
 | |
| 
 | |
| How you choose to load your trained pipelines ultimately depends on personal
 | |
| preference. However, **for larger code bases**, we usually recommend native
 | |
| imports, as this will make it easier to integrate pipeline packages with your
 | |
| existing build process, continuous integration workflow and testing framework.
 | |
| It'll also prevent you from ever trying to load a package that is not installed,
 | |
| as your code will raise an `ImportError` immediately, instead of failing
 | |
| somewhere down the line when calling `spacy.load()`. For more details, see the
 | |
| section on [working with pipeline packages in production](#production).
 | |
| 
 | |
| ## Using trained pipelines in production {id="production"}
 | |
| 
 | |
| If your application depends on one or more trained pipeline packages, you'll
 | |
| usually want to integrate them into your continuous integration workflow and
 | |
| build process. While spaCy provides a range of useful helpers for downloading
 | |
| and loading pipeline packages, the underlying functionality is entirely based on
 | |
| native Python packaging. This allows your application to handle a spaCy pipeline
 | |
| like any other package dependency.
 | |
| 
 | |
| ### Downloading and requiring package dependencies {id="models-download"}
 | |
| 
 | |
| spaCy's built-in [`download`](/api/cli#download) command is mostly intended as a
 | |
| convenient, interactive wrapper. It performs compatibility checks and prints
 | |
| detailed error messages and warnings. However, if you're downloading pipeline
 | |
| packages as part of an automated build process, this only adds an unnecessary
 | |
| layer of complexity. If you know which packages your application needs, you
 | |
| should be specifying them directly.
 | |
| 
 | |
| Because pipeline packages are valid Python packages, you can add them to your
 | |
| application's `requirements.txt`. If you're running your own internal PyPi
 | |
| installation, you can upload the pipeline packages there. pip's
 | |
| [requirements file format](https://pip.pypa.io/en/latest/reference/requirements-file-format/)
 | |
| supports both package names to download via a PyPi server, as well as
 | |
| [direct URLs](#pipeline-urls).
 | |
| 
 | |
| ```text {title="requirements.txt"}
 | |
| spacy>=3.0.0,<4.0.0
 | |
| en_core_web_sm @ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.4.0/en_core_web_sm-3.4.0-py3-none-any.whl
 | |
| ```
 | |
| 
 | |
| All pipeline packages are versioned and specify their spaCy dependency. This
 | |
| ensures cross-compatibility and lets you specify exact version requirements for
 | |
| each pipeline. If you've [trained](/usage/training) your own pipeline, you can
 | |
| use the [`spacy package`](/api/cli#package) command to generate the required
 | |
| meta data and turn it into a loadable package.
 | |
| 
 | |
| ### Loading and testing pipeline packages {id="models-loading"}
 | |
| 
 | |
| Pipeline packages are regular Python packages, so you can also import them as a
 | |
| package using Python's native `import` syntax, and then call the `load` method
 | |
| to load the data and return an `nlp` object:
 | |
| 
 | |
| ```python
 | |
| import en_core_web_sm
 | |
| nlp = en_core_web_sm.load()
 | |
| ```
 | |
| 
 | |
| In general, this approach is recommended for larger code bases, as it's more
 | |
| "native", and doesn't rely on spaCy's loader to resolve string names to
 | |
| packages. If a package can't be imported, Python will raise an `ImportError`
 | |
| immediately. And if a package is imported but not used, any linter will catch
 | |
| that.
 | |
| 
 | |
| Similarly, it'll give you more flexibility when writing tests that require
 | |
| loading pipelines. For example, instead of writing your own `try` and `except`
 | |
| logic around spaCy's loader, you can use
 | |
| [pytest](http://pytest.readthedocs.io/en/latest/)'s
 | |
| [`importorskip()`](https://docs.pytest.org/en/latest/builtin.html#_pytest.outcomes.importorskip)
 | |
| method to only run a test if a specific pipeline package or version is
 | |
| installed. Each pipeline package exposes a `__version__` attribute which you can
 | |
| also use to perform your own version compatibility checks before loading it.
 |