//- 💫 DOCS > API > CYTHON > ARCHITECTURE include ../_includes/_mixins +section("overview") +aside("What's Cython?") | #[+a("http://cython.org/") Cython] is a language for writing | C extensions for Python. Most Python code is also valid Cython, but | you can add type declarations to get efficient memory-managed code | just like C or C++. p | This section documents spaCy's C-level data structures and | interfaces, intended for use from Cython. Some of the attributes are | primarily for internal use, and all C-level functions and methods are | designed for speed over safety – if you make a mistake and access an | array out-of-bounds, the program may crash abruptly. p | With Cython there are four ways of declaring complex data types. | Unfortunately we use all four in different places, as they all have | different utility: +table(["Declaration", "Description", "Example"]) +row +cell #[code class] +cell A normal Python class. +cell #[+api("language") #[code Language]] +row +cell #[code cdef class] +cell | A Python extension type. Differs from a normal Python class | in that its attributes can be defined on the underlying | struct. Can have C-level objects as attributes (notably | structs and pointers), and can have methods which have | C-level objects as arguments or return types. +cell #[+api("cython-classes#lexeme") #[code Lexeme]] +row +cell #[code cdef struct] +cell | A struct is just a collection of variables, sort of like a | named tuple, except the memory is contiguous. Structs can't | have methods, only attributes. +cell #[+api("cython-structs#lexemec") #[code LexemeC]] +row +cell #[code cdef cppclass] +cell | A C++ class. Like a struct, this can be allocated on the | stack, but can have methods, a constructor and a destructor. | Differs from `cdef class` in that it can be created and | destroyed without acquiring the Python global interpreter | lock. This style is the most obscure. +cell #[+src(gh("spacy", "spacy/syntax/_state.pxd")) #[code StateC]] p | The most important classes in spaCy are defined as #[code cdef class] | objects. The underlying data for these objects is usually gathered | into a struct, which is usually named #[code c]. For instance, the | #[+api("cython-classses#lexeme") #[code Lexeme]] class holds a | #[+api("cython-structs#lexemec") #[code LexemeC]] struct, at | #[code Lexeme.c]. This lets you shed the Python container, and pass | a pointer to the underlying data into C-level functions. +section("conventions") +h(2, "conventions") Conventions p | spaCy's core data structures are implemented as | #[+a("http://cython.org/") Cython] #[code cdef] classes. Memory is | managed through the #[+a(gh("cymem")) #[code cymem]] | #[code cymem.Pool] class, which allows you | to allocate memory which will be freed when the #[code Pool] object | is garbage collected. This means you usually don't have to worry | about freeing memory. You just have to decide which Python object | owns the memory, and make it own the #[code Pool]. When that object | goes out of scope, the memory will be freed. You do have to take | care that no pointers outlive the object that owns them — but this | is generally quite easy. p | All Cython modules should have the #[code # cython: infer_types=True] | compiler directive at the top of the file. This makes the code much | cleaner, as it avoids the need for many type declarations. If | possible, you should prefer to declare your functions #[code nogil], | even if you don't especially care about multi-threading. The reason | is that #[code nogil] functions help the Cython compiler reason about | your code quite a lot — you're telling the compiler that no Python | dynamics are possible. This lets many errors be raised, and ensures | your function will run at C speed. p | Cython gives you many choices of sequences: you could have a Python | list, a numpy array, a memory view, a C++ vector, or a pointer. | Pointers are preferred, because they are fastest, have the most | explicit semantics, and let the compiler check your code more | strictly. C++ vectors are also great — but you should only use them | internally in functions. It's less friendly to accept a vector as an | argument, because that asks the user to do much more work. Here's | how to get a pointer from a numpy array, memory view or vector: +code. cdef void get_pointers(np.ndarray[int, mode='c'] numpy_array, vector[int] cpp_vector, int[::1] memory_view) nogil: pointer1 = <int*>numpy_array.data pointer2 = cpp_vector.data() pointer3 = &memory_view[0] p | Both C arrays and C++ vectors reassure the compiler that no Python | operations are possible on your variable. This is a big advantage: | it lets the Cython compiler raise many more errors for you. p | When getting a pointer from a numpy array or memoryview, take care | that the data is actually stored in C-contiguous order — otherwise | you'll get a pointer to nonsense. The type-declarations in the code | above should generate runtime errors if buffers with incorrect | memory layouts are passed in. To iterate over the array, the | following style is preferred: +code. cdef int c_total(const int* int_array, int length) nogil: total = 0 for item in int_array[:length]: total += item return total p | If this is confusing, consider that the compiler couldn't deal with | #[code for item in int_array:] — there's no length attached to a raw | pointer, so how could we figure out where to stop? The length is | provided in the slice notation as a solution to this. Note that we | don't have to declare the type of #[code item] in the code above — | the compiler can easily infer it. This gives us tidy code that looks | quite like Python, but is exactly as fast as C — because we've made | sure the compilation to C is trivial. p | Your functions cannot be declared #[code nogil] if they need to | create Python objects or call Python functions. This is perfectly | okay — you shouldn't torture your code just to get #[code nogil] | functions. However, if your function isn't #[code nogil], you should | compile your module with #[code cython -a --cplus my_module.pyx] and | open the resulting #[code my_module.html] file in a browser. This | will let you see how Cython is compiling your code. Calls into the | Python run-time will be in bright yellow. This lets you easily see | whether Cython is able to correctly type your code, or whether there | are unexpected problems. p | Working in Cython is very rewarding once you're over the initial | learning curve. As with C and C++, the first way you write something | in Cython will often be the performance-optimal approach. In | contrast, Python optimisation generally requires a lot of | experimentation. Is it faster to have an #[code if item in my_dict] | check, or to use #[code .get()]? What about | #[code try]/#[code except]? Does this numpy operation create a copy? | There's no way to guess the answers to these questions, and you'll | usually be dissatisfied with your results — so there's no way to | know when to stop this process. In the worst case, you'll make a | mess that invites the next reader to try their luck too. This is | like one of those | #[+a("http://www.wemjournal.org/article/S1080-6032%2809%2970088-2/abstract") volcanic gas-traps], | where the rescuers keep passing out from low oxygen, causing | another rescuer to follow — only to succumb themselves. In short, | just say no to optimizing your Python. If it's not fast enough the | first time, just switch to Cython. +infobox("Resources") +list.o-no-block +item #[+a("http://docs.cython.org/en/latest/") Official Cython documentation] (cython.org) +item #[+a("https://explosion.ai/blog/writing-c-in-cython", true) Writing C in Cython] (explosion.ai) +item #[+a("https://explosion.ai/blog/multithreading-with-cython") Multi-threading spaCy’s parser and named entity recogniser] (explosion.ai)