mirror of
				https://github.com/explosion/spaCy.git
				synced 2025-10-26 13:41:21 +03:00 
			
		
		
		
	
		
			
				
	
	
		
			148 lines
		
	
	
		
			6.6 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
			
		
		
	
	
			148 lines
		
	
	
		
			6.6 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
| //- 💫 DOCS > USAGE > PROCESSING TEXT
 | |
| 
 | |
| include ../../_includes/_mixins
 | |
| 
 | |
| +under-construction
 | |
| 
 | |
| +h(2, "multithreading") Multi-threading with #[code .pipe()]
 | |
| 
 | |
| p
 | |
|     |  If you have a sequence of documents to process, you should use the
 | |
|     |  #[+api("language#pipe") #[code Language.pipe()]] method. The method takes
 | |
|     |  an iterator of texts, and accumulates an internal buffer,
 | |
|     |  which it works on in parallel. It then yields the documents in order,
 | |
|     |  one-by-one. After a long and bitter struggle, the global interpreter
 | |
|     |  lock was freed around spaCy's main parsing loop in v0.100.3. This means
 | |
|     |  that #[code .pipe()] will be significantly faster in most
 | |
|     |  practical situations, because it allows shared memory parallelism.
 | |
| 
 | |
| +code.
 | |
|     for doc in nlp.pipe(texts, batch_size=10000, n_threads=3):
 | |
|        pass
 | |
| 
 | |
| p
 | |
|     |  To make full use of the #[code .pipe()] function, you might want to
 | |
|     |  brush up on #[strong Python generators]. Here are a few quick hints:
 | |
| 
 | |
| +list
 | |
|     +item
 | |
|         |  Generator comprehensions can be written as
 | |
|         |  #[code (item for item in sequence)].
 | |
| 
 | |
|     +item
 | |
|         |  The
 | |
|         |  #[+a("https://docs.python.org/2/library/itertools.html") #[code itertools] built-in library]
 | |
|         |  and the
 | |
|         |  #[+a("https://github.com/pytoolz/cytoolz") #[code cytoolz] package]
 | |
|         |  provide a lot of handy #[strong generator tools].
 | |
| 
 | |
|     +item
 | |
|         |  Often you'll have an input stream that pairs text with some
 | |
|         |  important meta data, e.g. a JSON document. To
 | |
|         |  #[strong pair up the meta data] with the processed #[code Doc]
 | |
|         |  object, you should use the #[code itertools.tee] function to split
 | |
|         |  the generator in two, and then #[code izip] the extra stream to the
 | |
|         |  document stream.
 | |
| 
 | |
| +h(2, "own-annotations") Bringing your own annotations
 | |
| 
 | |
| p
 | |
|     |  spaCy generally assumes by default that your data is raw text. However,
 | |
|     |  sometimes your data is partially annotated, e.g. with pre-existing
 | |
|     |  tokenization, part-of-speech tags, etc. The most common situation is
 | |
|     |  that you have pre-defined tokenization. If you have a list of strings,
 | |
|     |  you can create a #[code Doc] object directly. Optionally, you can also
 | |
|     |  specify a list of boolean values, indicating whether each word has a
 | |
|     |  subsequent space.
 | |
| 
 | |
| +code.
 | |
|     doc = Doc(nlp.vocab, words=[u'Hello', u',', u'world', u'!'], spaces=[False, True, False, False])
 | |
| 
 | |
| p
 | |
|     |  If provided, the spaces list must be the same length as the words list.
 | |
|     |  The spaces list affects the #[code doc.text], #[code span.text],
 | |
|     |  #[code token.idx], #[code span.start_char] and #[code span.end_char]
 | |
|     |  attributes. If you don't provide a #[code spaces] sequence, spaCy will
 | |
|     |  assume that all words are whitespace delimited.
 | |
| 
 | |
| +code.
 | |
|     good_spaces = Doc(nlp.vocab, words=[u'Hello', u',', u'world', u'!'], spaces=[False, True, False, False])
 | |
|     bad_spaces = Doc(nlp.vocab, words=[u'Hello', u',', u'world', u'!'])
 | |
|     assert bad_spaces.text == u'Hello , world !'
 | |
|     assert good_spaces.text == u'Hello, world!'
 | |
| 
 | |
| p
 | |
|     |  Once you have a #[+api("doc") #[code Doc]] object, you can write to its
 | |
|     |  attributes to set the part-of-speech tags, syntactic dependencies, named
 | |
|     |  entities and other attributes. For details, see the respective usage
 | |
|     |  pages.
 | |
| 
 | |
| +h(2, "models") Working with models
 | |
| 
 | |
| p
 | |
|     |  If your application depends on one or more #[+a("/docs/usage/models") models],
 | |
|     |  you'll usually want to integrate them into your continuous integration
 | |
|     |  workflow and build process. While spaCy provides a range of useful helpers
 | |
|     |  for downloading, linking and loading models, the underlying functionality
 | |
|     |  is entirely based on native Python packages. This allows your application
 | |
|     |  to handle a model like any other package dependency.
 | |
| 
 | |
| +h(3, "models-download") Downloading and requiring model dependencies
 | |
| 
 | |
| p
 | |
|     |  spaCy's built-in #[+api("cli#download") #[code download]] command
 | |
|     |  is mostly intended as a convenient, interactive wrapper. It performs
 | |
|     |  compatibility checks and prints detailed error messages and warnings.
 | |
|     |  However, if you're downloading models as part of an automated build
 | |
|     |  process, this only adds an unecessary layer of complexity. If you know
 | |
|     |  which models your application needs, you should be specifying them directly.
 | |
| 
 | |
| p
 | |
|     |  Because all models are valid Python packages, you can add them to your
 | |
|     |  application's #[code requirements.txt]. If you're running your own
 | |
|     |  internal PyPi installation, you can simply upload the models there. pip's
 | |
|     |  #[+a("https://pip.pypa.io/en/latest/reference/pip_install/#requirements-file-format") requirements file format]
 | |
|     |  supports both package names to download via a PyPi server, as well as direct
 | |
|     |  URLs.
 | |
| 
 | |
| +code("requirements.txt", "text").
 | |
|     spacy>=2.0.0,<3.0.0
 | |
|     -e #{gh("spacy-models")}/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz
 | |
| 
 | |
| p
 | |
|     |  All models are versioned and specify their spaCy dependency. This ensures
 | |
|     |  cross-compatibility and lets you specify exact version requirements for
 | |
|     |  each model. If you've trained your own model, you can use the
 | |
|     |  #[+api("cli#package") #[code package]] command to generate the required
 | |
|     |  meta data and turn it into a loadable package.
 | |
| 
 | |
| +h(3, "models-loading") Loading and testing models
 | |
| 
 | |
| p
 | |
|     |  Downloading models directly via pip won't call spaCy's link
 | |
|     |  #[+api("cli#link") #[code link]] command, which creates
 | |
|     |  symlinks for model shortcuts. This means that you'll have to run this
 | |
|     |  command separately, or use the native #[code import] syntax to load the
 | |
|     |  models:
 | |
| 
 | |
| +code.
 | |
|     import en_core_web_sm
 | |
|     nlp = en_core_web_sm.load()
 | |
| 
 | |
| p
 | |
|     |  In general, this approach is recommended for larger code bases, as it's
 | |
|     |  more "native", and doesn't depend on symlinks or rely on spaCy's loader
 | |
|     |  to resolve string names to model packages. If a model can't be
 | |
|     |  imported, Python will raise an #[code ImportError] immediately. And if a
 | |
|     |  model is imported but not used, any linter will catch that.
 | |
| 
 | |
| p
 | |
|     |  Similarly, it'll give you more flexibility when writing tests that
 | |
|     |  require loading models. For example, instead of writing your own
 | |
|     |  #[code try] and #[code except] logic around spaCy's loader, you can use
 | |
|     |  #[+a("http://pytest.readthedocs.io/en/latest/") pytest]'s
 | |
|     |  #[code importorskip()] method to only run a test if a specific model or
 | |
|     |  model version is installed. Each model package exposes a #[code __version__]
 | |
|     |  attribute which you can also use to perform your own version compatibility
 | |
|     |  checks before loading a model.
 |