spaCy/website/blog/multithreading-with-cython.jade

include ../_includes/_mixins

+lead In v0.100.3, we quietly rolled out support for GIL-free multi-threading for spaCy's syntactic dependency parsing and named entity recognition models.  Because these models take up a lot of memory, we've wanted to release the global interpretter lock (GIL) around them for a long time.  When we finally did, it seemed a little too good to be true, so we delayed celebration &mdash; and then quickly moved on to other things. It's now past time for a write-up.

p This is mostly an implementation post, but to me, implementation is the pain and the product is the pleasure. So, let's start with the pay-off. The pay-off is the #[code .pipe()] method, which adds #[em data-streaming] capabilities to spaCy:

+code('python', 'Stream Parsing').
    import spacy
    nlp = spacy.load('de')

    for doc in nlp.pipe(texts, n_threads=16, batch_size=10000):
        analyse_text(doc)

p The #[code .pipe()] method accepts an iterator (above, #[code texts]), and produces an iterator. Internally, a buffer is accumulated (given by the #[code batch_size] argument, and multiple threads are allowed to work on the batch simultaneously. Once the batch is complete, the processed documents are yielded from the iterator.

    +aside("Iterators") #[a(href="http://rare-technologies.com/data-streaming-in-python-generators-iterators-iterables/") My favourite post] on the Zen of Python iterators was written by Radim, the creator of #[a(href="http://radimrehurek.com/gensim/") Gensim]. I was on board with generators before, but I hadn't really thought about the simplicity of minibatching.

p Each document is processed independently, so if your batch size is large enough, and OpenMP is enabled, you should be able to work all your cores with only one copy of the spaCy models in memory.  spaCy is designed for web-scale data processing &mdash; we want you to be able to perform sophisticated linguistic analysis on whole dumps of the Common Crawl. With effective shared memory parallelism, those jobs are many times cheaper.

+table(["Method", "Number threads", "Seconds (lower is better)"])
    +row
        +cell Loop
        +cell 1
        +cell 691s
    +row
        +cell Pipe
        +cell 1
        +cell 678s
    +row
        +cell Pipe
        +cell 2
        +cell 432s
    +row
        +cell Pipe
        +cell 4
        +cell 312s

    caption.
        Seconds to parse 20,000 documents, with 1, 2 or 4 threads. The #[em loop] condition uses the #[code doc = nlp(text)] function, instead of the #[code .pipe] method, to show the overhead incurred from the minibatched stream processing (within measurement error).


+h3("implementation") Python, Cython and the Global Interpretter Lock

p Endless ink has been spilled about the CPython #[a(href="https://wiki.python.org/moin/GlobalInterpreterLock") Global Interpretter Lock (GIL)]. It isn't a problem for most code, but for spaCy, it really is. Computers may be fast and getting faster, but the internet is big and getting bigger. We have a lot fo text to process, and we'd like to use our machines efficiently.

p Python maintains reference counts in a global data structure. When you create or delete a Python object, its reference count has to change. However, the data structure holding the reference counts is not thread-safe. To change the reference counts, you therefore need to acquire the global interpretter lock.

p One way around the GIL is therefore to avoid the need for Python variables. This is what I've done with spaCy. More specifically, spaCy is a Python library, but it's not actually written in Python. It's implemented in Cython, and transpiled into a C++ extension module.

p In ordinary Python code, you can have a list of numbers like this:

+code('my_list.py', 'Python list').
    my_list = [0, 1, 2]

p In Cython, you can write exactly the same code, but the code is not interpreted by Python directly. Instead, it's transpiled into C or C++ code, which calls the Python C-API. Here's some of the resulting code:

+code('my_list.pyx', 'Transpiled C').
    __pyx_t_1 = PyList_New(3); if (unlikely(!__pyx_t_1)) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 1; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
    __Pyx_GOTREF(__pyx_t_1);
    __Pyx_INCREF(__pyx_int_0);
    __Pyx_GIVEREF(__pyx_int_0);
    PyList_SET_ITEM(__pyx_t_1, 0, __pyx_int_0);
    __Pyx_INCREF(__pyx_int_1);
    __Pyx_GIVEREF(__pyx_int_1);
    PyList_SET_ITEM(__pyx_t_1, 1, __pyx_int_1);
    __Pyx_INCREF(__pyx_int_2);
    __Pyx_GIVEREF(__pyx_int_2);
    PyList_SET_ITEM(__pyx_t_1, 2, __pyx_int_2)

p You can't call any of those functions if you're not holding the GIL. But you can call plain old C and C++ functions, such as #[code malloc()] and #[code free]:

+code('my_array.pyx', 'C in Cython').
    from libc.stlib cimport malloc, free

    my_arr = &lt;int*&gt;malloc(sizeof(int) * 3)
    my_arr[0] = 1
    my_arr[1] = 2
    my_arr[2] = 3
    do_stuff(my_arr)
    free(my_arr)

p The Cython #[code nogil] keyword allows you to declare that a function is safe to call even if you're not already holding the GIL.  You can read more about releasing the GIL with Cython #[a(href="http://docs.cython.org/src/userguide/parallelism.html") here].

p The disadvantages of writing with #[code nogil] semantics are obvious &mdash; you're limited to writing #[a(href="https://spacy.io/blog/writing-c-in-cython") C with (arguably) nicer syntax]. If you've never tried it, I think it's an interesting exercise to do without the Python semantics. It does make you appreciate what the language is providing. Probably the thing I miss most are the exceptions and the lists. The Python unicode object is also very useful.

+h3("parser-pipe") Implementation of the Parser.pipe method

p Here's the implementation of the Parser.pipe method in spaCy. This method does the following:

+list
    +item Buffers the texts into temporary work arrays
    +item Releases the GIL
    +item Iterates over the work arrays in an OpenMP #[code prange] loop
    +item Calls the #[code Parser.parseC()] method for each unit of work (each document)

+code('python', 'Parser.pipe').
    def pipe(self, stream, int batch_size=1000, int n_threads=2):
        cdef Pool mem = Pool()
        cdef TokenC** doc_ptr = &lt;TokenC**&gt;mem.alloc(batch_size, sizeof(TokenC*))
        cdef int* lengths = &lt;int*&gt;mem.alloc(batch_size, sizeof(int))
        cdef Doc doc
        cdef int i
        cdef int nr_class = self.moves.n_moves
        cdef int nr_feat = self.model.nr_feat
        cdef int status
        queue = []
        for doc in stream:
            doc_ptr[len(queue)] = doc.c
            lengths[len(queue)] = doc.length
            queue.append(doc)
            if len(queue) == batch_size:
                with nogil:
                    for i in cython.parallel.prange(batch_size, num_threads=n_threads):
                        status = self.parseC(doc_ptr[i], lengths[i], nr_feat, nr_class)
                        if status != 0:
                            with gil:
                                sent_str = queue[i].text
                                raise ValueError("Error parsing doc: %s" % sent_str)
                PyErr_CheckSignals()
                for doc in queue:
                    self.moves.finalize_doc(doc)
                    yield doc
                queue = []
        batch_size = len(queue)
        with nogil:
            for i in cython.parallel.prange(batch_size, num_threads=n_threads):
                status = self.parseC(doc_ptr[i], lengths[i], nr_feat, nr_class)
                if status != 0:
                    with gil:
                        sent_str = queue[i].text
                        raise ValueError("Error parsing doc: %s" % sent_str)
        PyErr_CheckSignals()
        for doc in queue:
            self.moves.finalize_doc(doc)
            yield doc

p The actual mechanics of the multi-threading are super simple, because NLP is (often) embarrassingly parallel &mdash; every document is parsed independently, so we just need to make a #[code prange] loop over a stream of texts. The #[code prange] function is an auto-magical work-sharing loop, that manages the OpenMP semantics for you. You still need to reason about false-sharing, thread safety, etc &mdash; all the parts that make writing multi-threaded code fundamentally challenging. But, at least the calling syntax is clean, and a few incidental details are taken care of for you.

+h3('the-hard-part') The hard part

p I couldn't tell you that multi-threading the parser was easy. At least, not with a straight face. I've never written a significant Java program, but I imagine writing multi-threaded Java is significantly easier. Using Cython, the task was at least possible. But it definitely wasn't easy.

p If you count my time in academia, I've been writing statistical parsers in Cython for a five or six years now, and I've always wanted to release the GIL around the parsing loop. By late 2015 I had the machine learning, hash table, outer parsing loop, and most of the feature extraction as #[code nogil] functions. But the #[a(href="https://github.com/spacy-io/spaCy/blob/0.100.2/spacy/syntax/stateclass.pxd#L10") state object] had a complicated interface, and was implemented as a #[code cdef class]. I couldn't create this object or store it in a container without acquiring the GIL.

p The break-through came when I figured out an undocumented way to write a C++ class in Cython. This allowed me to hollow out the existing #[code cdef class] that controlled the parser state. I proxied its interface to the inner C++ class, method by method. This way I could keep the code working, and make sure I didn't introduce any subtle bugs into the feature calculation.

p You can see the inner class #[a(href="https://github.com/spacy-io/spaCy/blob/master/spacy/syntax/_state.pxd") here]. If you navigate around the git history of this file, you can see the patches where I implemented the #[code .pipe] method.

+h3('conclusion') Conclusion

p Natural language processing (NLP) programs have some peculiar performance characterstics. The algorithms and data structures involved are often rather complicated, and the numeric calculations being performed are often quite trivial.

p spaCy's parser and named entity recognition system have to make two or three predictions on each word of the input document, using a linear model. All of the complexity is in the feature extraction and state management code, and the computational bottle-neck ends up being #[a(href="https://github.com/spacy-io/thinc/blob/master/thinc/linear/avgtron.pyx#L128") retrieval of the weights data] from main memory.

p When we finally switch over to a neural network model, the considerations will be a little bit different. It's possible to implement the parser such that you're stepping forward through multiple sentences at once. However, that has its own challenges. I think it's both easier and more effective to parallelise the outer loop, so I expect the work put into the current implementation will serve us well.