mirror of
https://github.com/explosion/spaCy.git
synced 2024-12-27 10:26:35 +03:00
177 lines
9.1 KiB
Plaintext
177 lines
9.1 KiB
Plaintext
|
//- 💫 DOCS > API > CYTHON > ARCHITECTURE
|
|||
|
|
|||
|
include ../_includes/_mixins
|
|||
|
|
|||
|
+section("overview")
|
|||
|
+aside("What's Cython?")
|
|||
|
| #[+a("http://cython.org/") Cython] is a language for writing
|
|||
|
| C extensions for Python. Most Python code is also valid Cython, but
|
|||
|
| you can add type declarations to get efficient memory-managed code
|
|||
|
| just like C or C++.
|
|||
|
|
|||
|
p
|
|||
|
| This section documents spaCy's C-level data structures and
|
|||
|
| interfaces, intended for use from Cython. Some of the attributes are
|
|||
|
| primarily for internal use, and all C-level functions and methods are
|
|||
|
| designed for speed over safety – if you make a mistake and access an
|
|||
|
| array out-of-bounds, the program may crash abruptly.
|
|||
|
|
|||
|
p
|
|||
|
| With Cython there are four ways of declaring complex data types.
|
|||
|
| Unfortunately we use all four in different places, as they all have
|
|||
|
| different utility:
|
|||
|
|
|||
|
+table(["Declaration", "Description", "Example"])
|
|||
|
+row
|
|||
|
+cell #[code class]
|
|||
|
+cell A normal Python class.
|
|||
|
+cell #[+api("language") #[code Language]]
|
|||
|
|
|||
|
+row
|
|||
|
+cell #[code cdef class]
|
|||
|
+cell
|
|||
|
| A Python extension type. Differs from a normal Python class
|
|||
|
| in that its attributes can be defined on the underlying
|
|||
|
| struct. Can have C-level objects as attributes (notably
|
|||
|
| structs and pointers), and can have methods which have
|
|||
|
| C-level objects as arguments or return types.
|
|||
|
+cell #[+api("cython-classes#lexeme") #[code Lexeme]]
|
|||
|
|
|||
|
+row
|
|||
|
+cell #[code cdef struct]
|
|||
|
+cell
|
|||
|
| A struct is just a collection of variables, sort of like a
|
|||
|
| named tuple, except the memory is contiguous. Structs can't
|
|||
|
| have methods, only attributes.
|
|||
|
+cell #[+api("cython-structs#lexemec") #[code LexemeC]]
|
|||
|
|
|||
|
+row
|
|||
|
+cell #[code cdef cppclass]
|
|||
|
+cell
|
|||
|
| A C++ class. Like a struct, this can be allocated on the
|
|||
|
| stack, but can have methods, a constructor and a destructor.
|
|||
|
| Differs from `cdef class` in that it can be created and
|
|||
|
| destroyed without acquiring the Python global interpreter
|
|||
|
| lock. This style is the most obscure.
|
|||
|
+cell #[+src(gh("spacy", "spacy/syntax/_state.pxd")) #[code StateC]]
|
|||
|
|
|||
|
p
|
|||
|
| The most important classes in spaCy are defined as #[code cdef class]
|
|||
|
| objects. The underlying data for these objects is usually gathered
|
|||
|
| into a struct, which is usually named #[code c]. For instance, the
|
|||
|
| #[+api("cython-classses#lexeme") #[code Lexeme]] class holds a
|
|||
|
| #[+api("cython-structs#lexemec") #[code LexemeC]] struct, at
|
|||
|
| #[code Lexeme.c]. This lets you shed the Python container, and pass
|
|||
|
| a pointer to the underlying data into C-level functions.
|
|||
|
|
|||
|
+section("conventions")
|
|||
|
+h(2, "conventions") Conventions
|
|||
|
|
|||
|
p
|
|||
|
| spaCy's core data structures are implemented as
|
|||
|
| #[+a("http://cython.org/") Cython] #[code cdef] classes. Memory is
|
|||
|
| managed through the #[+a(gh("cymem")) #[code cymem]]
|
|||
|
| #[code cymem.Pool] class, which allows you
|
|||
|
| to allocate memory which will be freed when the #[code Pool] object
|
|||
|
| is garbage collected. This means you usually don't have to worry
|
|||
|
| about freeing memory. You just have to decide which Python object
|
|||
|
| owns the memory, and make it own the #[code Pool]. When that object
|
|||
|
| goes out of scope, the memory will be freed. You do have to take
|
|||
|
| care that no pointers outlive the object that owns them — but this
|
|||
|
| is generally quite easy.
|
|||
|
|
|||
|
p
|
|||
|
| All Cython modules should have the #[code # cython: infer_types=True]
|
|||
|
| compiler directive at the top of the file. This makes the code much
|
|||
|
| cleaner, as it avoids the need for many type declarations. If
|
|||
|
| possible, you should prefer to declare your functions #[code nogil],
|
|||
|
| even if you don't especially care about multi-threading. The reason
|
|||
|
| is that #[code nogil] functions help the Cython compiler reason about
|
|||
|
| your code quite a lot — you're telling the compiler that no Python
|
|||
|
| dynamics are possible. This lets many errors be raised, and ensures
|
|||
|
| your function will run at C speed.
|
|||
|
|
|||
|
|
|||
|
p
|
|||
|
| Cython gives you many choices of sequences: you could have a Python
|
|||
|
| list, a numpy array, a memory view, a C++ vector, or a pointer.
|
|||
|
| Pointers are preferred, because they are fastest, have the most
|
|||
|
| explicit semantics, and let the compiler check your code more
|
|||
|
| strictly. C++ vectors are also great — but you should only use them
|
|||
|
| internally in functions. It's less friendly to accept a vector as an
|
|||
|
| argument, because that asks the user to do much more work. Here's
|
|||
|
| how to get a pointer from a numpy array, memory view or vector:
|
|||
|
|
|||
|
+code.
|
|||
|
cdef void get_pointers(np.ndarray[int, mode='c'] numpy_array, vector[int] cpp_vector, int[::1] memory_view) nogil:
|
|||
|
pointer1 = <int*>numpy_array.data
|
|||
|
pointer2 = cpp_vector.data()
|
|||
|
pointer3 = &memory_view[0]
|
|||
|
|
|||
|
p
|
|||
|
| Both C arrays and C++ vectors reassure the compiler that no Python
|
|||
|
| operations are possible on your variable. This is a big advantage:
|
|||
|
| it lets the Cython compiler raise many more errors for you.
|
|||
|
|
|||
|
p
|
|||
|
| When getting a pointer from a numpy array or memoryview, take care
|
|||
|
| that the data is actually stored in C-contiguous order — otherwise
|
|||
|
| you'll get a pointer to nonsense. The type-declarations in the code
|
|||
|
| above should generate runtime errors if buffers with incorrect
|
|||
|
| memory layouts are passed in. To iterate over the array, the
|
|||
|
| following style is preferred:
|
|||
|
|
|||
|
+code.
|
|||
|
cdef int c_total(const int* int_array, int length) nogil:
|
|||
|
total = 0
|
|||
|
for item in int_array[:length]:
|
|||
|
total += item
|
|||
|
return total
|
|||
|
|
|||
|
p
|
|||
|
| If this is confusing, consider that the compiler couldn't deal with
|
|||
|
| #[code for item in int_array:] — there's no length attached to a raw
|
|||
|
| pointer, so how could we figure out where to stop? The length is
|
|||
|
| provided in the slice notation as a solution to this. Note that we
|
|||
|
| don't have to declare the type of #[code item] in the code above —
|
|||
|
| the compiler can easily infer it. This gives us tidy code that looks
|
|||
|
| quite like Python, but is exactly as fast as C — because we've made
|
|||
|
| sure the compilation to C is trivial.
|
|||
|
|
|||
|
p
|
|||
|
| Your functions cannot be declared #[code nogil] if they need to
|
|||
|
| create Python objects or call Python functions. This is perfectly
|
|||
|
| okay — you shouldn't torture your code just to get #[code nogil]
|
|||
|
| functions. However, if your function isn't #[code nogil], you should
|
|||
|
| compile your module with #[code cython -a --cplus my_module.pyx] and
|
|||
|
| open the resulting #[code my_module.html] file in a browser. This
|
|||
|
| will let you see how Cython is compiling your code. Calls into the
|
|||
|
| Python run-time will be in bright yellow. This lets you easily see
|
|||
|
| whether Cython is able to correctly type your code, or whether there
|
|||
|
| are unexpected problems.
|
|||
|
|
|||
|
p
|
|||
|
| Working in Cython is very rewarding once you're over the initial
|
|||
|
| learning curve. As with C and C++, the first way you write something
|
|||
|
| in Cython will often be the performance-optimal approach. In
|
|||
|
| contrast, Python optimisation generally requires a lot of
|
|||
|
| experimentation. Is it faster to have an #[code if item in my_dict]
|
|||
|
| check, or to use #[code .get()]? What about
|
|||
|
| #[code try]/#[code except]? Does this numpy operation create a copy?
|
|||
|
| There's no way to guess the answers to these questions, and you'll
|
|||
|
| usually be dissatisfied with your results — so there's no way to
|
|||
|
| know when to stop this process. In the worst case, you'll make a
|
|||
|
| mess that invites the next reader to try their luck too. This is
|
|||
|
| like one of those
|
|||
|
| #[+a("http://www.wemjournal.org/article/S1080-6032%2809%2970088-2/abstract") volcanic gas-traps],
|
|||
|
| where the rescuers keep passing out from low oxygen, causing
|
|||
|
| another rescuer to follow — only to succumb themselves. In short,
|
|||
|
| just say no to optimizing your Python. If it's not fast enough the
|
|||
|
| first time, just switch to Cython.
|
|||
|
|
|||
|
+infobox("Resources")
|
|||
|
+list.o-no-block
|
|||
|
+item #[+a("http://docs.cython.org/en/latest/") Official Cython documentation] (cython.org)
|
|||
|
+item #[+a("https://explosion.ai/blog/writing-c-in-cython", true) Writing C in Cython] (explosion.ai)
|
|||
|
+item #[+a("https://explosion.ai/blog/multithreading-with-cython") Multi-threading spaCy’s parser and named entity recogniser] (explosion.ai)
|