spaCy/website/docs/usage/production-use.jade

//- 💫 DOCS > USAGE > PROCESSING TEXT

include ../../_includes/_mixins

+under-construction

+h(2, "multithreading") Multi-threading with #[code .pipe()]

p
    |  If you have a sequence of documents to process, you should use the
    |  #[+api("language#pipe") #[code Language.pipe()]] method. The method takes
    |  an iterator of texts, and accumulates an internal buffer,
    |  which it works on in parallel. It then yields the documents in order,
    |  one-by-one. After a long and bitter struggle, the global interpreter
    |  lock was freed around spaCy's main parsing loop in v0.100.3. This means
    |  that #[code .pipe()] will be significantly faster in most
    |  practical situations, because it allows shared memory parallelism.

+code.
    for doc in nlp.pipe(texts, batch_size=10000, n_threads=3):
       pass

p
    |  To make full use of the #[code .pipe()] function, you might want to
    |  brush up on #[strong Python generators]. Here are a few quick hints:

+list
    +item
        |  Generator comprehensions can be written as
        |  #[code (item for item in sequence)].

    +item
        |  The
        |  #[+a("https://docs.python.org/2/library/itertools.html") #[code itertools] built-in library]
        |  and the
        |  #[+a("https://github.com/pytoolz/cytoolz") #[code cytoolz] package]
        |  provide a lot of handy #[strong generator tools].

    +item
        |  Often you'll have an input stream that pairs text with some
        |  important meta data, e.g. a JSON document. To
        |  #[strong pair up the meta data] with the processed #[code Doc]
        |  object, you should use the #[code itertools.tee] function to split
        |  the generator in two, and then #[code izip] the extra stream to the
        |  document stream.

+h(2, "own-annotations") Bringing your own annotations

p
    |  spaCy generally assumes by default that your data is raw text. However,
    |  sometimes your data is partially annotated, e.g. with pre-existing
    |  tokenization, part-of-speech tags, etc. The most common situation is
    |  that you have pre-defined tokenization. If you have a list of strings,
    |  you can create a #[code Doc] object directly. Optionally, you can also
    |  specify a list of boolean values, indicating whether each word has a
    |  subsequent space.

+code.
    doc = Doc(nlp.vocab, words=[u'Hello', u',', u'world', u'!'], spaces=[False, True, False, False])

p
    |  If provided, the spaces list must be the same length as the words list.
    |  The spaces list affects the #[code doc.text], #[code span.text],
    |  #[code token.idx], #[code span.start_char] and #[code span.end_char]
    |  attributes. If you don't provide a #[code spaces] sequence, spaCy will
    |  assume that all words are whitespace delimited.

+code.
    good_spaces = Doc(nlp.vocab, words=[u'Hello', u',', u'world', u'!'], spaces=[False, True, False, False])
    bad_spaces = Doc(nlp.vocab, words=[u'Hello', u',', u'world', u'!'])
    assert bad_spaces.text == u'Hello , world !'
    assert good_spaces.text == u'Hello, world!'

p
    |  Once you have a #[+api("doc") #[code Doc]] object, you can write to its
    |  attributes to set the part-of-speech tags, syntactic dependencies, named
    |  entities and other attributes. For details, see the respective usage
    |  pages.

+h(2, "models") Working with models

p
    |  If your application depends on one or more #[+a("/docs/usage/models") models],
    |  you'll usually want to integrate them into your continuous integration
    |  workflow and build process. While spaCy provides a range of useful helpers
    |  for downloading, linking and loading models, the underlying functionality
    |  is entirely based on native Python packages. This allows your application
    |  to handle a model like any other package dependency.

+h(3, "models-download") Downloading and requiring model dependencies

p
    |  spaCy's built-in #[+api("cli#download") #[code download]] command
    |  is mostly intended as a convenient, interactive wrapper. It performs
    |  compatibility checks and prints detailed error messages and warnings.
    |  However, if you're downloading models as part of an automated build
    |  process, this only adds an unecessary layer of complexity. If you know
    |  which models your application needs, you should be specifying them directly.

p
    |  Because all models are valid Python packages, you can add them to your
    |  application's #[code requirements.txt]. If you're running your own
    |  internal PyPi installation, you can simply upload the models there. pip's
    |  #[+a("https://pip.pypa.io/en/latest/reference/pip_install/#requirements-file-format") requirements file format]
    |  supports both package names to download via a PyPi server, as well as direct
    |  URLs.

+code("requirements.txt", "text").
    spacy&gt;=2.0.0,&lt;3.0.0
    -e #{gh("spacy-models")}/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz

p
    |  All models are versioned and specify their spaCy dependency. This ensures
    |  cross-compatibility and lets you specify exact version requirements for
    |  each model. If you've trained your own model, you can use the
    |  #[+api("cli#package") #[code package]] command to generate the required
    |  meta data and turn it into a loadable package.

+h(3, "models-loading") Loading and testing models

p
    |  Downloading models directly via pip won't call spaCy's link
    |  #[+api("cli#link") #[code link]] command, which creates
    |  symlinks for model shortcuts. This means that you'll have to run this
    |  command separately, or use the native #[code import] syntax to load the
    |  models:

+code.
    import en_core_web_sm
    nlp = en_core_web_sm.load()

p
    |  In general, this approach is recommended for larger code bases, as it's
    |  more "native", and doesn't depend on symlinks or rely on spaCy's loader
    |  to resolve string names to model packages. If a model can't be
    |  imported, Python will raise an #[code ImportError] immediately. And if a
    |  model is imported but not used, any linter will catch that.

p
    |  Similarly, it'll give you more flexibility when writing tests that
    |  require loading models. For example, instead of writing your own
    |  #[code try] and #[code except] logic around spaCy's loader, you can use
    |  #[+a("http://pytest.readthedocs.io/en/latest/") pytest]'s
    |  #[code importorskip()] method to only run a test if a specific model or
    |  model version is installed. Each model package exposes a #[code __version__]
    |  attribute which you can also use to perform your own version compatibility
    |  checks before loading a model.
Update to new website 2016-10-31 21:04:15 +03:00			`//- 💫 DOCS > USAGE > PROCESSING TEXT`

			`include ../../_includes/_mixins`

Update usage docs and ddd "under construction" 2017-05-26 14:17:48 +03:00			`+under-construction`

Update to new website 2016-10-31 21:04:15 +03:00			`+h(2, "multithreading") Multi-threading with #[code .pipe()]`

			`p`
			`\| If you have a sequence of documents to process, you should use the`
Update usage docs and ddd "under construction" 2017-05-26 14:17:48 +03:00			`\| #[+api("language#pipe") #[code Language.pipe()]] method. The method takes`
			`\| an iterator of texts, and accumulates an internal buffer,`
Update to new website 2016-10-31 21:04:15 +03:00			`\| which it works on in parallel. It then yields the documents in order,`
			`\| one-by-one. After a long and bitter struggle, the global interpreter`
			`\| lock was freed around spaCy's main parsing loop in v0.100.3. This means`
Update usage docs and ddd "under construction" 2017-05-26 14:17:48 +03:00			`\| that #[code .pipe()] will be significantly faster in most`
Fix a bunch of missing spaces of the website 2016-11-20 20:02:45 +03:00			`\| practical situations, because it allows shared memory parallelism.`
Update to new website 2016-10-31 21:04:15 +03:00
			`+code.`
			`for doc in nlp.pipe(texts, batch_size=10000, n_threads=3):`
			`pass`

			`p`
			`\| To make full use of the #[code .pipe()] function, you might want to`
Update usage docs and ddd "under construction" 2017-05-26 14:17:48 +03:00			`\| brush up on #[strong Python generators]. Here are a few quick hints:`
Update to new website 2016-10-31 21:04:15 +03:00
			`+list`
			`+item`
Update usage docs and ddd "under construction" 2017-05-26 14:17:48 +03:00			`\| Generator comprehensions can be written as`
			`\| #[code (item for item in sequence)].`
Update to new website 2016-10-31 21:04:15 +03:00
			`+item`
Update usage docs and ddd "under construction" 2017-05-26 14:17:48 +03:00			`\| The`
			`\| #[+a("https://docs.python.org/2/library/itertools.html") #[code itertools] built-in library]`
			`\| and the`
			`\| #[+a("https://github.com/pytoolz/cytoolz") #[code cytoolz] package]`
			`\| provide a lot of handy #[strong generator tools].`
Update to new website 2016-10-31 21:04:15 +03:00
			`+item`
			`\| Often you'll have an input stream that pairs text with some`
Update usage docs and ddd "under construction" 2017-05-26 14:17:48 +03:00			`\| important meta data, e.g. a JSON document. To`
			`\| #[strong pair up the meta data] with the processed #[code Doc]`
			`\| object, you should use the #[code itertools.tee] function to split`
			`\| the generator in two, and then #[code izip] the extra stream to the`
			`\| document stream.`
Update to new website 2016-10-31 21:04:15 +03:00
			`+h(2, "own-annotations") Bringing your own annotations`

			`p`
			`\| spaCy generally assumes by default that your data is raw text. However,`
			`\| sometimes your data is partially annotated, e.g. with pre-existing`
			`\| tokenization, part-of-speech tags, etc. The most common situation is`
			`\| that you have pre-defined tokenization. If you have a list of strings,`
			`\| you can create a #[code Doc] object directly. Optionally, you can also`
			`\| specify a list of boolean values, indicating whether each word has a`
			`\| subsequent space.`

			`+code.`
			`doc = Doc(nlp.vocab, words=[u'Hello', u',', u'world', u'!'], spaces=[False, True, False, False])`

			`p`
			`\| If provided, the spaces list must be the same length as the words list.`
			`\| The spaces list affects the #[code doc.text], #[code span.text],`
			`\| #[code token.idx], #[code span.start_char] and #[code span.end_char]`
			`\| attributes. If you don't provide a #[code spaces] sequence, spaCy will`
			`\| assume that all words are whitespace delimited.`

			`+code.`
			`good_spaces = Doc(nlp.vocab, words=[u'Hello', u',', u'world', u'!'], spaces=[False, True, False, False])`
			`bad_spaces = Doc(nlp.vocab, words=[u'Hello', u',', u'world', u'!'])`
			`assert bad_spaces.text == u'Hello , world !'`
			`assert good_spaces.text == u'Hello, world!'`

			`p`
			`\| Once you have a #[+api("doc") #[code Doc]] object, you can write to its`
			`\| attributes to set the part-of-speech tags, syntactic dependencies, named`
			`\| entities and other attributes. For details, see the respective usage`
			`\| pages.`
Add more details on model packages and requirements.txt (see #1099) 2017-06-04 21:52:10 +03:00
			`+h(2, "models") Working with models`

			`p`
			`\| If your application depends on one or more #[+a("/docs/usage/models") models],`
			`\| you'll usually want to integrate them into your continuous integration`
			`\| workflow and build process. While spaCy provides a range of useful helpers`
			`\| for downloading, linking and loading models, the underlying functionality`
			`\| is entirely based on native Python packages. This allows your application`
			`\| to handle a model like any other package dependency.`

			`+h(3, "models-download") Downloading and requiring model dependencies`

			`p`
			`\| spaCy's built-in #[+api("cli#download") #[code download]] command`
			`\| is mostly intended as a convenient, interactive wrapper. It performs`
			`\| compatibility checks and prints detailed error messages and warnings.`
			`\| However, if you're downloading models as part of an automated build`
			`\| process, this only adds an unecessary layer of complexity. If you know`
			`\| which models your application needs, you should be specifying them directly.`

			`p`
			`\| Because all models are valid Python packages, you can add them to your`
			`\| application's #[code requirements.txt]. If you're running your own`
			`\| internal PyPi installation, you can simply upload the models there. pip's`
			`\| #[+a("https://pip.pypa.io/en/latest/reference/pip_install/#requirements-file-format") requirements file format]`
			`\| supports both package names to download via a PyPi server, as well as direct`
			`\| URLs.`

			`+code("requirements.txt", "text").`
			`spacy>=2.0.0,<3.0.0`
			`-e #{gh("spacy-models")}/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz`

			`p`
			`\| All models are versioned and specify their spaCy dependency. This ensures`
			`\| cross-compatibility and lets you specify exact version requirements for`
			`\| each model. If you've trained your own model, you can use the`
			`\| #[+api("cli#package") #[code package]] command to generate the required`
			`\| meta data and turn it into a loadable package.`

			`+h(3, "models-loading") Loading and testing models`

			`p`
			`\| Downloading models directly via pip won't call spaCy's link`
			`\| #[+api("cli#link") #[code link]] command, which creates`
			`\| symlinks for model shortcuts. This means that you'll have to run this`
			`\| command separately, or use the native #[code import] syntax to load the`
			`\| models:`

			`+code.`
			`import en_core_web_sm`
			`nlp = en_core_web_sm.load()`

			`p`
			`\| In general, this approach is recommended for larger code bases, as it's`
			`\| more "native", and doesn't depend on symlinks or rely on spaCy's loader`
			`\| to resolve string names to model packages. If a model can't be`
			`\| imported, Python will raise an #[code ImportError] immediately. And if a`
			`\| model is imported but not used, any linter will catch that.`

			`p`
			`\| Similarly, it'll give you more flexibility when writing tests that`
			`\| require loading models. For example, instead of writing your own`
			`\| #[code try] and #[code except] logic around spaCy's loader, you can use`
			`\| #[+a("http://pytest.readthedocs.io/en/latest/") pytest]'s`
			`\| #[code importorskip()] method to only run a test if a specific model or`
			`\| model version is installed. Each model package exposes a #[code __version__]`
			`\| attribute which you can also use to perform your own version compatibility`
			`\| checks before loading a model.`