spaCy/website/docs/usage/production-use.jade

//- 💫 DOCS > USAGE > PROCESSING TEXT

include ../../_includes/_mixins

+h(2, "multithreading") Multi-threading with #[code .pipe()]

p
    |  If you have a sequence of documents to process, you should use the
    |  #[+api("language#pipe") #[code .pipe()]] method. The method takes an
    |  iterator of texts, and accumulates an internal buffer,
    |  which it works on in parallel. It then yields the documents in order,
    |  one-by-one. After a long and bitter struggle, the global interpreter
    |  lock was freed around spaCy's main parsing loop in v0.100.3. This means
    |  that the #[code .pipe()] method will be significantly faster in most
    |  practical situations, because it allows shared memory parallelism.

+code.
    for doc in nlp.pipe(texts, batch_size=10000, n_threads=3):
       pass

p
    |  To make full use of the #[code .pipe()] function, you might want to
    |  brush up on Python generators. Here are a few quick hints:

+list
    +item
        |  Generator comprehensions can be written
        |  (#[code item for item in sequence])

    +item
        |  The #[code itertools] built-in library and the #[code cytoolz]
        |  package provide a lot of handy generator tools

    +item
        |  Often you'll have an input stream that pairs text with some
        |  important metadata, e.g. a JSON document. To pair up the metadata
        |  with the processed #[code Doc] object, you should use the tee
        |  function to split the generator in two, and then #[code izip] the
        |  extra stream to the document stream.

+h(2, "own-annotations") Bringing your own annotations

p
    |  spaCy generally assumes by default that your data is raw text. However,
    |  sometimes your data is partially annotated, e.g. with pre-existing
    |  tokenization, part-of-speech tags, etc. The most common situation is
    |  that you have pre-defined tokenization. If you have a list of strings,
    |  you can create a #[code Doc] object directly. Optionally, you can also
    |  specify a list of boolean values, indicating whether each word has a
    |  subsequent space.

+code.
    doc = Doc(nlp.vocab, words=[u'Hello', u',', u'world', u'!'], spaces=[False, True, False, False])

p
    |  If provided, the spaces list must be the same length as the words list.
    |  The spaces list affects the #[code doc.text], #[code span.text],
    |  #[code token.idx], #[code span.start_char] and #[code span.end_char]
    |  attributes. If you don't provide a #[code spaces] sequence, spaCy will
    |  assume that all words are whitespace delimited.

+code.
    good_spaces = Doc(nlp.vocab, words=[u'Hello', u',', u'world', u'!'], spaces=[False, True, False, False])
    bad_spaces = Doc(nlp.vocab, words=[u'Hello', u',', u'world', u'!'])
    assert bad_spaces.text == u'Hello , world !'
    assert good_spaces.text == u'Hello, world!'

p
    |  Once you have a #[+api("doc") #[code Doc]] object, you can write to its
    |  attributes to set the part-of-speech tags, syntactic dependencies, named
    |  entities and other attributes. For details, see the respective usage
    |  pages.