mirror of
				https://github.com/explosion/spaCy.git
				synced 2025-10-25 21:21:10 +03:00 
			
		
		
		
	## Description This PR adds the most relevant documentation of spaCy's Cython API. (Todo for when we publish this: rewrite `/api/#section-cython` and `/api/#cython` to `/api/cython#conventions`.) ### Types of change docs ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
		
			
				
	
	
		
			177 lines
		
	
	
		
			9.1 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
			
		
		
	
	
			177 lines
		
	
	
		
			9.1 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
| //- 💫 DOCS > API > CYTHON > ARCHITECTURE
 | ||
| 
 | ||
| include ../_includes/_mixins
 | ||
| 
 | ||
| +section("overview")
 | ||
|     +aside("What's Cython?")
 | ||
|         |  #[+a("http://cython.org/") Cython] is a language for writing
 | ||
|         |  C extensions for Python. Most Python code is also valid Cython, but
 | ||
|         |  you can add type declarations to get efficient memory-managed code
 | ||
|         |  just like C or C++.
 | ||
| 
 | ||
|     p
 | ||
|         |  This section documents spaCy's C-level data structures and
 | ||
|         |  interfaces, intended for use from Cython. Some of the attributes are
 | ||
|         |  primarily for internal use, and all C-level functions and methods are
 | ||
|         |  designed for speed over safety – if you make a mistake and access an
 | ||
|         |  array out-of-bounds, the program may crash abruptly.
 | ||
| 
 | ||
|     p
 | ||
|         |  With Cython there are four ways of declaring complex data types.
 | ||
|         |  Unfortunately we use all four in different places, as they all have
 | ||
|         |  different utility:
 | ||
| 
 | ||
|     +table(["Declaration", "Description", "Example"])
 | ||
|         +row
 | ||
|             +cell #[code class]
 | ||
|             +cell A normal Python class.
 | ||
|             +cell #[+api("language") #[code Language]]
 | ||
| 
 | ||
|         +row
 | ||
|             +cell #[code cdef class]
 | ||
|             +cell
 | ||
|                 |  A Python extension type. Differs from a normal Python class
 | ||
|                 |  in that its attributes can be defined on the underlying
 | ||
|                 |  struct. Can have C-level objects as attributes (notably
 | ||
|                 |  structs and pointers), and can have methods which have
 | ||
|                 |  C-level objects as arguments or return types.
 | ||
|             +cell #[+api("cython-classes#lexeme") #[code Lexeme]]
 | ||
| 
 | ||
|         +row
 | ||
|             +cell #[code cdef struct]
 | ||
|             +cell
 | ||
|                 |  A struct is just a collection of variables, sort of like a
 | ||
|                 |  named tuple, except the memory is contiguous. Structs can't
 | ||
|                 |  have methods, only attributes.
 | ||
|             +cell #[+api("cython-structs#lexemec") #[code LexemeC]]
 | ||
| 
 | ||
|         +row
 | ||
|             +cell #[code cdef cppclass]
 | ||
|             +cell
 | ||
|                 |  A C++ class. Like a struct, this can be allocated on the
 | ||
|                 |  stack, but can have methods, a constructor and a destructor.
 | ||
|                 |  Differs from `cdef class` in that it can be created and
 | ||
|                 |  destroyed without acquiring the Python global interpreter
 | ||
|                 |  lock. This style is the most obscure.
 | ||
|             +cell #[+src(gh("spacy", "spacy/syntax/_state.pxd")) #[code StateC]]
 | ||
| 
 | ||
|     p
 | ||
|         |  The most important classes in spaCy are defined as #[code cdef class]
 | ||
|         |  objects. The underlying data for these objects is usually gathered
 | ||
|         |  into a struct, which is usually named #[code c]. For instance, the
 | ||
|         |  #[+api("cython-classses#lexeme") #[code Lexeme]] class holds a
 | ||
|         |  #[+api("cython-structs#lexemec") #[code LexemeC]] struct, at
 | ||
|         |  #[code Lexeme.c]. This lets you shed the Python container, and pass
 | ||
|         |  a pointer to the underlying data into C-level functions.
 | ||
| 
 | ||
| +section("conventions")
 | ||
|     +h(2, "conventions") Conventions
 | ||
| 
 | ||
|     p
 | ||
|         |  spaCy's core data structures are implemented as
 | ||
|         |  #[+a("http://cython.org/") Cython] #[code cdef] classes. Memory is
 | ||
|         |  managed through the #[+a(gh("cymem")) #[code cymem]]
 | ||
|         |  #[code cymem.Pool] class, which allows you
 | ||
|         |  to allocate memory which will be freed when the #[code Pool] object
 | ||
|         |  is garbage collected. This means you usually don't have to worry
 | ||
|         |  about freeing memory. You just have to decide which Python object
 | ||
|         |  owns the memory, and make it own the #[code Pool]. When that object
 | ||
|         |  goes out of scope, the memory will be freed. You do have to take
 | ||
|         |  care that no pointers outlive the object that owns them — but this
 | ||
|         |  is generally quite easy.
 | ||
| 
 | ||
|     p
 | ||
|         |  All Cython modules should have the #[code # cython: infer_types=True]
 | ||
|         |  compiler directive at the top of the file. This makes the code much
 | ||
|         |  cleaner, as it avoids the need for many type declarations. If
 | ||
|         |  possible, you should prefer to declare your functions #[code nogil],
 | ||
|         |  even if you don't especially care about multi-threading. The reason
 | ||
|         |  is that #[code nogil] functions help the Cython compiler reason about
 | ||
|         |  your code quite a lot — you're telling the compiler that no Python
 | ||
|         |  dynamics are possible. This lets many errors be raised, and ensures
 | ||
|         |  your function will run at C speed.
 | ||
| 
 | ||
| 
 | ||
|     p
 | ||
|         |  Cython gives you many choices of sequences: you could have a Python
 | ||
|         |  list, a numpy array, a memory view, a C++ vector, or a pointer.
 | ||
|         |  Pointers are preferred, because they are fastest, have the most
 | ||
|         |  explicit semantics, and let the compiler check your code more
 | ||
|         |  strictly. C++ vectors are also great — but you should only use them
 | ||
|         |  internally in functions. It's less friendly to accept a vector as an
 | ||
|         |  argument, because that asks the user to do much more work. Here's
 | ||
|         |  how to get a pointer from a numpy array, memory view or vector:
 | ||
| 
 | ||
|     +code.
 | ||
|         cdef void get_pointers(np.ndarray[int, mode='c'] numpy_array, vector[int] cpp_vector, int[::1] memory_view) nogil:
 | ||
|         pointer1 = <int*>numpy_array.data
 | ||
|         pointer2 = cpp_vector.data()
 | ||
|         pointer3 = &memory_view[0]
 | ||
| 
 | ||
|     p
 | ||
|         |  Both C arrays and C++ vectors reassure the compiler that no Python
 | ||
|         |  operations are possible on your variable. This is a big advantage:
 | ||
|         |  it lets the Cython compiler raise many more errors for you.
 | ||
| 
 | ||
|     p
 | ||
|         |  When getting a pointer from a numpy array or memoryview, take care
 | ||
|         |  that the data is actually stored in C-contiguous order — otherwise
 | ||
|         |  you'll get a pointer to nonsense. The type-declarations in the code
 | ||
|         |  above should generate runtime errors if buffers with incorrect
 | ||
|         |  memory layouts are passed in. To iterate over the array, the
 | ||
|         |  following style is preferred:
 | ||
| 
 | ||
|     +code.
 | ||
|         cdef int c_total(const int* int_array, int length) nogil:
 | ||
|             total = 0
 | ||
|             for item in int_array[:length]:
 | ||
|                 total += item
 | ||
|             return total
 | ||
| 
 | ||
|     p
 | ||
|         |  If this is confusing, consider that the compiler couldn't deal with
 | ||
|         |  #[code for item in int_array:] — there's no length attached to a raw
 | ||
|         |  pointer, so how could we figure out where to stop? The length is
 | ||
|         |  provided in the slice notation as a solution to this. Note that we
 | ||
|         |  don't have to declare the type of #[code item] in the code above —
 | ||
|         |  the compiler can easily infer it. This gives us tidy code that looks
 | ||
|         |  quite like Python, but is exactly as fast as C — because we've made
 | ||
|         |  sure the compilation to C is trivial.
 | ||
| 
 | ||
|     p
 | ||
|         |  Your functions cannot be declared #[code nogil] if they need to
 | ||
|         |  create Python objects or call Python functions. This is perfectly
 | ||
|         |  okay — you shouldn't torture your code just to get #[code nogil]
 | ||
|         |  functions. However, if your function isn't #[code nogil], you should
 | ||
|         |  compile your module with #[code cython -a --cplus my_module.pyx] and
 | ||
|         |  open the resulting #[code my_module.html] file in a browser. This
 | ||
|         |  will let you see how Cython is compiling your code. Calls into the
 | ||
|         |  Python run-time will be in bright yellow. This lets you easily see
 | ||
|         |  whether Cython is able to correctly type your code, or whether there
 | ||
|         |  are unexpected problems.
 | ||
| 
 | ||
|     p
 | ||
|         |  Working in Cython is very rewarding once you're over the initial
 | ||
|         |  learning curve. As with C and C++, the first way you write something
 | ||
|         |  in Cython will often be the performance-optimal approach. In
 | ||
|         |  contrast, Python optimisation generally requires a lot of
 | ||
|         |  experimentation. Is it faster to have an #[code if item in my_dict]
 | ||
|         |  check, or to use #[code .get()]? What about
 | ||
|         |  #[code try]/#[code except]? Does this numpy operation create a copy?
 | ||
|         |  There's no way to guess the answers to these questions, and you'll
 | ||
|         |  usually be dissatisfied with your results — so there's no way to
 | ||
|         |  know when to stop this process. In the worst case, you'll make a
 | ||
|         |  mess that invites the next reader to try their luck too. This is
 | ||
|         |  like one of those
 | ||
|         |  #[+a("http://www.wemjournal.org/article/S1080-6032%2809%2970088-2/abstract") volcanic gas-traps],
 | ||
|         |  where the rescuers keep passing out from low oxygen, causing
 | ||
|         |  another rescuer to follow — only to succumb themselves. In short,
 | ||
|         |  just say no to optimizing your Python. If it's not fast enough the
 | ||
|         |  first time, just switch to Cython.
 | ||
| 
 | ||
|     +infobox("Resources")
 | ||
|         +list.o-no-block
 | ||
|             +item #[+a("http://docs.cython.org/en/latest/") Official Cython documentation] (cython.org)
 | ||
|             +item #[+a("https://explosion.ai/blog/writing-c-in-cython", true) Writing C in Cython] (explosion.ai)
 | ||
|             +item #[+a("https://explosion.ai/blog/multithreading-with-cython") Multi-threading spaCy’s parser and named entity recogniser] (explosion.ai)
 |