mirror of
				https://github.com/explosion/spaCy.git
				synced 2025-11-01 00:17:44 +03:00 
			
		
		
		
	
		
			
				
	
	
		
			54 lines
		
	
	
		
			2.3 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
			
		
		
	
	
			54 lines
		
	
	
		
			2.3 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
| //- 💫 DOCS > USAGE > PROCESSING PIPELINES > MULTI-THREADING
 | |
| 
 | |
| p
 | |
|     |  If you have a sequence of documents to process, you should use the
 | |
|     |  #[+api("language#pipe") #[code Language.pipe()]] method. The method takes
 | |
|     |  an iterator of texts, and accumulates an internal buffer,
 | |
|     |  which it works on in parallel. It then yields the documents in order,
 | |
|     |  one-by-one. After a long and bitter struggle, the global interpreter
 | |
|     |  lock was freed around spaCy's main parsing loop in v0.100.3. This means
 | |
|     |  that #[code .pipe()] will be significantly faster in most
 | |
|     |  practical situations, because it allows shared memory parallelism.
 | |
| 
 | |
| +code.
 | |
|     for doc in nlp.pipe(texts, batch_size=10000, n_threads=3):
 | |
|        pass
 | |
| 
 | |
| p
 | |
|     |  To make full use of the #[code .pipe()] function, you might want to
 | |
|     |  brush up on #[strong Python generators]. Here are a few quick hints:
 | |
| 
 | |
| +list
 | |
|     +item
 | |
|         |  Generator comprehensions can be written as
 | |
|         |  #[code (item for item in sequence)].
 | |
| 
 | |
|     +item
 | |
|         |  The
 | |
|         |  #[+a("https://docs.python.org/2/library/itertools.html") #[code itertools] built-in library]
 | |
|         |  and the
 | |
|         |  #[+a("https://github.com/pytoolz/cytoolz") #[code cytoolz] package]
 | |
|         |  provide a lot of handy #[strong generator tools].
 | |
| 
 | |
|     +item
 | |
|         |  Often you'll have an input stream that pairs text with some
 | |
|         |  important meta data, e.g. a JSON document. To
 | |
|         |  #[strong pair up the meta data] with the processed #[code Doc]
 | |
|         |  object, you should use the #[code itertools.tee] function to split
 | |
|         |  the generator in two, and then #[code izip] the extra stream to the
 | |
|         |  document stream. Here's
 | |
|         |  #[+a(gh("spacy") + "/issues/172#issuecomment-183963403") an example].
 | |
| 
 | |
| +h(3, "multi-processing-example") Example: Multi-processing with Joblib
 | |
| 
 | |
| p
 | |
|     |  This example shows how to use multiple cores to process text using
 | |
|     |  spaCy and #[+a("https://pythonhosted.org/joblib/") Joblib]. We're
 | |
|     |  exporting part-of-speech-tagged, true-cased, (very roughly)
 | |
|     |  sentence-separated text, with each "sentence" on a newline, and
 | |
|     |  spaces between tokens. Data is loaded from the IMDB movie reviews
 | |
|     |  dataset and will be loaded automatically via Thinc's built-in dataset
 | |
|     |  loader.
 | |
| 
 | |
| +github("spacy", "examples/pipeline/multi_processing.py", 500)
 |