//- 💫 DOCS > USAGE > PROCESSING TEXT include ../../_includes/_mixins +under-construction +h(2, "multithreading") Multi-threading with #[code .pipe()] p | If you have a sequence of documents to process, you should use the | #[+api("language#pipe") #[code Language.pipe()]] method. The method takes | an iterator of texts, and accumulates an internal buffer, | which it works on in parallel. It then yields the documents in order, | one-by-one. After a long and bitter struggle, the global interpreter | lock was freed around spaCy's main parsing loop in v0.100.3. This means | that #[code .pipe()] will be significantly faster in most | practical situations, because it allows shared memory parallelism. +code. for doc in nlp.pipe(texts, batch_size=10000, n_threads=3): pass p | To make full use of the #[code .pipe()] function, you might want to | brush up on #[strong Python generators]. Here are a few quick hints: +list +item | Generator comprehensions can be written as | #[code (item for item in sequence)]. +item | The | #[+a("https://docs.python.org/2/library/itertools.html") #[code itertools] built-in library] | and the | #[+a("https://github.com/pytoolz/cytoolz") #[code cytoolz] package] | provide a lot of handy #[strong generator tools]. +item | Often you'll have an input stream that pairs text with some | important meta data, e.g. a JSON document. To | #[strong pair up the meta data] with the processed #[code Doc] | object, you should use the #[code itertools.tee] function to split | the generator in two, and then #[code izip] the extra stream to the | document stream. +h(2, "own-annotations") Bringing your own annotations p | spaCy generally assumes by default that your data is raw text. However, | sometimes your data is partially annotated, e.g. with pre-existing | tokenization, part-of-speech tags, etc. The most common situation is | that you have pre-defined tokenization. If you have a list of strings, | you can create a #[code Doc] object directly. Optionally, you can also | specify a list of boolean values, indicating whether each word has a | subsequent space. +code. doc = Doc(nlp.vocab, words=[u'Hello', u',', u'world', u'!'], spaces=[False, True, False, False]) p | If provided, the spaces list must be the same length as the words list. | The spaces list affects the #[code doc.text], #[code span.text], | #[code token.idx], #[code span.start_char] and #[code span.end_char] | attributes. If you don't provide a #[code spaces] sequence, spaCy will | assume that all words are whitespace delimited. +code. good_spaces = Doc(nlp.vocab, words=[u'Hello', u',', u'world', u'!'], spaces=[False, True, False, False]) bad_spaces = Doc(nlp.vocab, words=[u'Hello', u',', u'world', u'!']) assert bad_spaces.text == u'Hello , world !' assert good_spaces.text == u'Hello, world!' p | Once you have a #[+api("doc") #[code Doc]] object, you can write to its | attributes to set the part-of-speech tags, syntactic dependencies, named | entities and other attributes. For details, see the respective usage | pages. +h(2, "models") Working with models p | If your application depends on one or more #[+a("/docs/usage/models") models], | you'll usually want to integrate them into your continuous integration | workflow and build process. While spaCy provides a range of useful helpers | for downloading, linking and loading models, the underlying functionality | is entirely based on native Python packages. This allows your application | to handle a model like any other package dependency. +h(3, "models-download") Downloading and requiring model dependencies p | spaCy's built-in #[+api("cli#download") #[code download]] command | is mostly intended as a convenient, interactive wrapper. It performs | compatibility checks and prints detailed error messages and warnings. | However, if you're downloading models as part of an automated build | process, this only adds an unnecessary layer of complexity. If you know | which models your application needs, you should be specifying them directly. p | Because all models are valid Python packages, you can add them to your | application's #[code requirements.txt]. If you're running your own | internal PyPi installation, you can simply upload the models there. pip's | #[+a("https://pip.pypa.io/en/latest/reference/pip_install/#requirements-file-format") requirements file format] | supports both package names to download via a PyPi server, as well as direct | URLs. +code("requirements.txt", "text"). spacy>=2.0.0,<3.0.0 -e #{gh("spacy-models")}/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz p | All models are versioned and specify their spaCy dependency. This ensures | cross-compatibility and lets you specify exact version requirements for | each model. If you've trained your own model, you can use the | #[+api("cli#package") #[code package]] command to generate the required | meta data and turn it into a loadable package. +h(3, "models-loading") Loading and testing models p | Downloading models directly via pip won't call spaCy's link | #[+api("cli#link") #[code link]] command, which creates | symlinks for model shortcuts. This means that you'll have to run this | command separately, or use the native #[code import] syntax to load the | models: +code. import en_core_web_sm nlp = en_core_web_sm.load() p | In general, this approach is recommended for larger code bases, as it's | more "native", and doesn't depend on symlinks or rely on spaCy's loader | to resolve string names to model packages. If a model can't be | imported, Python will raise an #[code ImportError] immediately. And if a | model is imported but not used, any linter will catch that. p | Similarly, it'll give you more flexibility when writing tests that | require loading models. For example, instead of writing your own | #[code try] and #[code except] logic around spaCy's loader, you can use | #[+a("http://pytest.readthedocs.io/en/latest/") pytest]'s | #[code importorskip()] method to only run a test if a specific model or | model version is installed. Each model package exposes a #[code __version__] | attribute which you can also use to perform your own version compatibility | checks before loading a model.