mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-26 17:24:41 +03:00
54 lines
2.3 KiB
Plaintext
54 lines
2.3 KiB
Plaintext
//- 💫 DOCS > USAGE > PROCESSING PIPELINES > MULTI-THREADING
|
|
|
|
p
|
|
| If you have a sequence of documents to process, you should use the
|
|
| #[+api("language#pipe") #[code Language.pipe()]] method. The method takes
|
|
| an iterator of texts, and accumulates an internal buffer,
|
|
| which it works on in parallel. It then yields the documents in order,
|
|
| one-by-one. After a long and bitter struggle, the global interpreter
|
|
| lock was freed around spaCy's main parsing loop in v0.100.3. This means
|
|
| that #[code .pipe()] will be significantly faster in most
|
|
| practical situations, because it allows shared memory parallelism.
|
|
|
|
+code.
|
|
for doc in nlp.pipe(texts, batch_size=10000, n_threads=3):
|
|
pass
|
|
|
|
p
|
|
| To make full use of the #[code .pipe()] function, you might want to
|
|
| brush up on #[strong Python generators]. Here are a few quick hints:
|
|
|
|
+list
|
|
+item
|
|
| Generator comprehensions can be written as
|
|
| #[code (item for item in sequence)].
|
|
|
|
+item
|
|
| The
|
|
| #[+a("https://docs.python.org/2/library/itertools.html") #[code itertools] built-in library]
|
|
| and the
|
|
| #[+a("https://github.com/pytoolz/cytoolz") #[code cytoolz] package]
|
|
| provide a lot of handy #[strong generator tools].
|
|
|
|
+item
|
|
| Often you'll have an input stream that pairs text with some
|
|
| important meta data, e.g. a JSON document. To
|
|
| #[strong pair up the meta data] with the processed #[code Doc]
|
|
| object, you should use the #[code itertools.tee] function to split
|
|
| the generator in two, and then #[code izip] the extra stream to the
|
|
| document stream. Here's
|
|
| #[+a(gh("spacy") + "/issues/172#issuecomment-183963403") an example].
|
|
|
|
+h(3, "multi-processing-example") Example: Multi-processing with Joblib
|
|
|
|
p
|
|
| This example shows how to use multiple cores to process text using
|
|
| spaCy and #[+a("https://joblib.readthedocs.io/en/latest/parallel.html") Joblib]. We're
|
|
| exporting part-of-speech-tagged, true-cased, (very roughly)
|
|
| sentence-separated text, with each "sentence" on a newline, and
|
|
| spaces between tokens. Data is loaded from the IMDB movie reviews
|
|
| dataset and will be loaded automatically via Thinc's built-in dataset
|
|
| loader.
|
|
|
|
+github("spacy", "examples/pipeline/multi_processing.py", 500)
|