spaCy/website/api/_architecture/_cython.jade

//- 💫 DOCS > API > ARCHITECTURE > CYTHON

+aside("What's Cython?")
    |  #[+a("http://cython.org/") Cython] is a language for writing
    |  C extensions for Python. Most Python code is also valid Cython, but
    |  you can add type declarations to get efficient memory-managed code
    |  just like C or C++.

p
    |  spaCy's core data structures are implemented as
    |  #[+a("http://cython.org/") Cython] #[code cdef] classes. Memory is
    |  managed through the #[+a(gh("cymem")) #[code cymem]]
    |  #[code cymem.Pool] class, which allows you
    |  to allocate memory which will be freed when the #[code Pool] object
    |  is garbage collected. This means you usually don't have to worry
    |  about freeing memory. You just have to decide which Python object
    |  owns the memory, and make it own the #[code Pool]. When that object
    |  goes out of scope, the memory will be freed. You do have to take
    |  care that no pointers outlive the object that owns them — but this
    |  is generally quite easy.

p
    |  All Cython modules should have the #[code # cython: infer_types=True]
    |  compiler directive at the top of the file. This makes the code much
    |  cleaner, as it avoids the need for many type declarations. If
    |  possible, you should prefer to declare your functions #[code nogil],
    |  even if you don't especially care about multi-threading. The reason
    |  is that #[code nogil] functions help the Cython compiler reason about
    |  your code quite a lot — you're telling the compiler that no Python
    |  dynamics are possible. This lets many errors be raised, and ensures
    |  your function will run at C speed.


p
    |  Cython gives you many choices of sequences: you could have a Python
    |  list, a numpy array, a memory view, a C++ vector, or a pointer.
    |  Pointers are preferred, because they are fastest, have the most
    |  explicit semantics, and let the compiler check your code more
    |  strictly. C++ vectors are also great — but you should only use them
    |  internally in functions. It's less friendly to accept a vector as an
    |  argument, because that asks the user to do much more work. Here's
    |  how to get a pointer from a numpy array, memory view or vector:

+code.
    cdef void get_pointers(np.ndarray[int, mode='c'] numpy_array, vector[int] cpp_vector, int[::1] memory_view) nogil:
    pointer1 = &lt;int*&gt;numpy_array.data
    pointer2 = cpp_vector.data()
    pointer3 = &memory_view[0]

p
    |  Both C arrays and C++ vectors reassure the compiler that no Python
    |  operations are possible on your variable. This is a big advantage:
    |  it lets the Cython compiler raise many more errors for you.

p
    |  When getting a pointer from a numpy array or memoryview, take care
    |  that the data is actually stored in C-contiguous order — otherwise
    |  you'll get a pointer to nonsense. The type-declarations in the code
    |  above should generate runtime errors if buffers with incorrect
    |  memory layouts are passed in. To iterate over the array, the
    |  following style is preferred:

+code.
    cdef int c_total(const int* int_array, int length) nogil:
        total = 0
        for item in int_array[:length]:
            total += item
        return total

p
    |  If this is confusing, consider that the compiler couldn't deal with
    |  #[code for item in int_array:] — there's no length attached to a raw
    |  pointer, so how could we figure out where to stop? The length is
    |  provided in the slice notation as a solution to this. Note that we
    |  don't have to declare the type of #[code item] in the code above —
    |  the compiler can easily infer it. This gives us tidy code that looks
    |  quite like Python, but is exactly as fast as C — because we've made
    |  sure the compilation to C is trivial.

p
    |  Your functions cannot be declared #[code nogil] if they need to
    |  create Python objects or call Python functions. This is perfectly
    |  okay — you shouldn't torture your code just to get #[code nogil]
    |  functions. However, if your function isn't #[code nogil], you should
    |  compile your module with #[code cython -a --cplus my_module.pyx] and
    |  open the resulting #[code my_module.html] file in a browser. This
    |  will let you see how Cython is compiling your code. Calls into the
    |  Python run-time will be in bright yellow. This lets you easily see
    |  whether Cython is able to correctly type your code, or whether there
    |  are unexpected problems.

p
    |  Working in Cython is very rewarding once you're over the initial
    |  learning curve. As with C and C++, the first way you write something
    |  in Cython will often be the performance-optimal approach. In
    |  contrast, Python optimisation generally requires a lot of
    |  experimentation. Is it faster to have an #[code if item in my_dict]
    |  check, or to use #[code .get()]? What about
    |  #[code try]/#[code except]? Does this numpy operation create a copy?
    |  There's no way to guess the answers to these questions, and you'll
    |  usually be dissatisfied with your results — so there's no way to
    |  know when to stop this process. In the worst case, you'll make a
    |  mess that invites the next reader to try their luck too. This is
    |  like one of those
    |  #[+a("http://www.wemjournal.org/article/S1080-6032%2809%2970088-2/abstract") volcanic gas-traps],
    |  where the rescuers keep passing out from low oxygen, causing
    |  another rescuer to follow — only to succumb themselves. In short,
    |  just say no to optimizing your Python. If it's not fast enough the
    |  first time, just switch to Cython.

+infobox("Resources")
    +list.o-no-block
        +item #[+a("http://docs.cython.org/en/latest/") Official Cython documentation] (cython.org)
        +item #[+a("https://explosion.ai/blog/writing-c-in-cython", true) Writing C in Cython] (explosion.ai)
        +item #[+a("https://explosion.ai/blog/multithreading-with-cython") Multi-threading spaCy’s parser and named entity recogniser] (explosion.ai)