mirror of
				https://github.com/explosion/spaCy.git
				synced 2025-11-04 09:57:26 +03:00 
			
		
		
		
	Update API documentation
This commit is contained in:
		
							parent
							
								
									3f4fd2c5d5
								
							
						
					
					
						commit
						808f7ee417
					
				
							
								
								
									
										43
									
								
								website/api/_annotation/_biluo.jade
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										43
									
								
								website/api/_annotation/_biluo.jade
									
									
									
									
									
										Normal file
									
								
							| 
						 | 
					@ -0,0 +1,43 @@
 | 
				
			||||||
 | 
					//- 💫 DOCS > API > ANNOTATION > BILUO
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+table([ "Tag", "Description" ])
 | 
				
			||||||
 | 
					    +row
 | 
				
			||||||
 | 
					        +cell #[code #[span.u-color-theme B] EGIN]
 | 
				
			||||||
 | 
					        +cell The first token of a multi-token entity.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    +row
 | 
				
			||||||
 | 
					        +cell #[code #[span.u-color-theme I] N]
 | 
				
			||||||
 | 
					        +cell An inner token of a multi-token entity.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    +row
 | 
				
			||||||
 | 
					        +cell #[code #[span.u-color-theme L] AST]
 | 
				
			||||||
 | 
					        +cell The final token of a multi-token entity.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    +row
 | 
				
			||||||
 | 
					        +cell #[code #[span.u-color-theme U] NIT]
 | 
				
			||||||
 | 
					        +cell A single-token entity.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    +row
 | 
				
			||||||
 | 
					        +cell #[code #[span.u-color-theme O] UT]
 | 
				
			||||||
 | 
					        +cell A non-entity token.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+aside("Why BILUO, not IOB?")
 | 
				
			||||||
 | 
					    |  There are several coding schemes for encoding entity annotations as
 | 
				
			||||||
 | 
					    |  token tags.  These coding schemes are equally expressive, but not
 | 
				
			||||||
 | 
					    |  necessarily equally learnable.
 | 
				
			||||||
 | 
					    |  #[+a("http://www.aclweb.org/anthology/W09-1119") Ratinov and Roth]
 | 
				
			||||||
 | 
					    |  showed that the minimal #[strong Begin], #[strong In], #[strong Out]
 | 
				
			||||||
 | 
					    |  scheme was more difficult to learn than the #[strong BILUO] scheme that
 | 
				
			||||||
 | 
					    |  we use, which explicitly marks boundary tokens.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					p
 | 
				
			||||||
 | 
					    |  spaCy translates the character offsets into this scheme, in order to
 | 
				
			||||||
 | 
					    |  decide the cost of each action given the current state of the entity
 | 
				
			||||||
 | 
					    |  recogniser. The costs are then used to calculate the gradient of the
 | 
				
			||||||
 | 
					    |  loss, to train the model. The exact algorithm is a pastiche of
 | 
				
			||||||
 | 
					    |  well-known methods, and is not currently described in any single
 | 
				
			||||||
 | 
					    |  publication. The model is a greedy transition-based parser guided by a
 | 
				
			||||||
 | 
					    |  linear model whose weights are learned using the averaged perceptron
 | 
				
			||||||
 | 
					    |  loss, via the #[+a("http://www.aclweb.org/anthology/C12-1059") dynamic oracle]
 | 
				
			||||||
 | 
					    |  imitation learning strategy. The transition system is equivalent to the
 | 
				
			||||||
 | 
					    |  BILOU tagging scheme.
 | 
				
			||||||
							
								
								
									
										115
									
								
								website/api/_architecture/_cython.jade
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										115
									
								
								website/api/_architecture/_cython.jade
									
									
									
									
									
										Normal file
									
								
							| 
						 | 
					@ -0,0 +1,115 @@
 | 
				
			||||||
 | 
					//- 💫 DOCS > API > ARCHITECTURE > CYTHON
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+aside("What's Cython?")
 | 
				
			||||||
 | 
					    |  #[+a("http://cython.org/") Cython] is a language for writing
 | 
				
			||||||
 | 
					    |  C extensions for Python. Most Python code is also valid Cython, but
 | 
				
			||||||
 | 
					    |  you can add type declarations to get efficient memory-managed code
 | 
				
			||||||
 | 
					    |  just like C or C++.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					p
 | 
				
			||||||
 | 
					    |  spaCy's core data structures are implemented as
 | 
				
			||||||
 | 
					    |  #[+a("http://cython.org/") Cython] #[code cdef] classes. Memory is
 | 
				
			||||||
 | 
					    |  managed through the #[+a(gh("cymem")) #[code cymem]]
 | 
				
			||||||
 | 
					    |  #[code cymem.Pool] class, which allows you
 | 
				
			||||||
 | 
					    |  to allocate memory which will be freed when the #[code Pool] object
 | 
				
			||||||
 | 
					    |  is garbage collected. This means you usually don't have to worry
 | 
				
			||||||
 | 
					    |  about freeing memory. You just have to decide which Python object
 | 
				
			||||||
 | 
					    |  owns the memory, and make it own the #[code Pool]. When that object
 | 
				
			||||||
 | 
					    |  goes out of scope, the memory will be freed. You do have to take
 | 
				
			||||||
 | 
					    |  care that no pointers outlive the object that owns them — but this
 | 
				
			||||||
 | 
					    |  is generally quite easy.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					p
 | 
				
			||||||
 | 
					    |  All Cython modules should have the #[code # cython: infer_types=True]
 | 
				
			||||||
 | 
					    |  compiler directive at the top of the file. This makes the code much
 | 
				
			||||||
 | 
					    |  cleaner, as it avoids the need for many type declarations. If
 | 
				
			||||||
 | 
					    |  possible, you should prefer to declare your functions #[code nogil],
 | 
				
			||||||
 | 
					    |  even if you don't especially care about multi-threading. The reason
 | 
				
			||||||
 | 
					    |  is that #[code nogil] functions help the Cython compiler reason about
 | 
				
			||||||
 | 
					    |  your code quite a lot — you're telling the compiler that no Python
 | 
				
			||||||
 | 
					    |  dynamics are possible. This lets many errors be raised, and ensures
 | 
				
			||||||
 | 
					    |  your function will run at C speed.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					p
 | 
				
			||||||
 | 
					    |  Cython gives you many choices of sequences: you could have a Python
 | 
				
			||||||
 | 
					    |  list, a numpy array, a memory view, a C++ vector, or a pointer.
 | 
				
			||||||
 | 
					    |  Pointers are preferred, because they are fastest, have the most
 | 
				
			||||||
 | 
					    |  explicit semantics, and let the compiler check your code more
 | 
				
			||||||
 | 
					    |  strictly. C++ vectors are also great — but you should only use them
 | 
				
			||||||
 | 
					    |  internally in functions. It's less friendly to accept a vector as an
 | 
				
			||||||
 | 
					    |  argument, because that asks the user to do much more work. Here's
 | 
				
			||||||
 | 
					    |  how to get a pointer from a numpy array, memory view or vector:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+code.
 | 
				
			||||||
 | 
					    cdef void get_pointers(np.ndarray[int, mode='c'] numpy_array, vector[int] cpp_vector, int[::1] memory_view) nogil:
 | 
				
			||||||
 | 
					    pointer1 = <int*>numpy_array.data
 | 
				
			||||||
 | 
					    pointer2 = cpp_vector.data()
 | 
				
			||||||
 | 
					    pointer3 = &memory_view[0]
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					p
 | 
				
			||||||
 | 
					    |  Both C arrays and C++ vectors reassure the compiler that no Python
 | 
				
			||||||
 | 
					    |  operations are possible on your variable. This is a big advantage:
 | 
				
			||||||
 | 
					    |  it lets the Cython compiler raise many more errors for you.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					p
 | 
				
			||||||
 | 
					    |  When getting a pointer from a numpy array or memoryview, take care
 | 
				
			||||||
 | 
					    |  that the data is actually stored in C-contiguous order — otherwise
 | 
				
			||||||
 | 
					    |  you'll get a pointer to nonsense. The type-declarations in the code
 | 
				
			||||||
 | 
					    |  above should generate runtime errors if buffers with incorrect
 | 
				
			||||||
 | 
					    |  memory layouts are passed in. To iterate over the array, the
 | 
				
			||||||
 | 
					    |  following style is preferred:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+code.
 | 
				
			||||||
 | 
					    cdef int c_total(const int* int_array, int length) nogil:
 | 
				
			||||||
 | 
					        total = 0
 | 
				
			||||||
 | 
					        for item in int_array[:length]:
 | 
				
			||||||
 | 
					            total += item
 | 
				
			||||||
 | 
					        return total
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					p
 | 
				
			||||||
 | 
					    |  If this is confusing, consider that the compiler couldn't deal with
 | 
				
			||||||
 | 
					    |  #[code for item in int_array:] — there's no length attached to a raw
 | 
				
			||||||
 | 
					    |  pointer, so how could we figure out where to stop? The length is
 | 
				
			||||||
 | 
					    |  provided in the slice notation as a solution to this. Note that we
 | 
				
			||||||
 | 
					    |  don't have to declare the type of #[code item] in the code above —
 | 
				
			||||||
 | 
					    |  the compiler can easily infer it. This gives us tidy code that looks
 | 
				
			||||||
 | 
					    |  quite like Python, but is exactly as fast as C — because we've made
 | 
				
			||||||
 | 
					    |  sure the compilation to C is trivial.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					p
 | 
				
			||||||
 | 
					    |  Your functions cannot be declared #[code nogil] if they need to
 | 
				
			||||||
 | 
					    |  create Python objects or call Python functions. This is perfectly
 | 
				
			||||||
 | 
					    |  okay — you shouldn't torture your code just to get #[code nogil]
 | 
				
			||||||
 | 
					    |  functions. However, if your function isn't #[code nogil], you should
 | 
				
			||||||
 | 
					    |  compile your module with #[code cython -a --cplus my_module.pyx] and
 | 
				
			||||||
 | 
					    |  open the resulting #[code my_module.html] file in a browser. This
 | 
				
			||||||
 | 
					    |  will let you see how Cython is compiling your code. Calls into the
 | 
				
			||||||
 | 
					    |  Python run-time will be in bright yellow. This lets you easily see
 | 
				
			||||||
 | 
					    |  whether Cython is able to correctly type your code, or whether there
 | 
				
			||||||
 | 
					    |  are unexpected problems.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					p
 | 
				
			||||||
 | 
					    |  Working in Cython is very rewarding once you're over the initial
 | 
				
			||||||
 | 
					    |  learning curve. As with C and C++, the first way you write something
 | 
				
			||||||
 | 
					    |  in Cython will often be the performance-optimal approach. In
 | 
				
			||||||
 | 
					    |  contrast, Python optimisation generally requires a lot of
 | 
				
			||||||
 | 
					    |  experimentation. Is it faster to have an #[code if item in my_dict]
 | 
				
			||||||
 | 
					    |  check, or to use #[code .get()]? What about
 | 
				
			||||||
 | 
					    |  #[code try]/#[code except]? Does this numpy operation create a copy?
 | 
				
			||||||
 | 
					    |  There's no way to guess the answers to these questions, and you'll
 | 
				
			||||||
 | 
					    |  usually be dissatisfied with your results — so there's no way to
 | 
				
			||||||
 | 
					    |  know when to stop this process. In the worst case, you'll make a
 | 
				
			||||||
 | 
					    |  mess that invites the next reader to try their luck too. This is
 | 
				
			||||||
 | 
					    |  like one of those
 | 
				
			||||||
 | 
					    |  #[+a("http://www.wemjournal.org/article/S1080-6032%2809%2970088-2/abstract") volcanic gas-traps],
 | 
				
			||||||
 | 
					    |  where the rescuers keep passing out from low oxygen, causing
 | 
				
			||||||
 | 
					    |  another rescuer to follow — only to succumb themselves. In short,
 | 
				
			||||||
 | 
					    |  just say no to optimizing your Python. If it's not fast enough the
 | 
				
			||||||
 | 
					    |  first time, just switch to Cython.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+infobox("Resources")
 | 
				
			||||||
 | 
					    +list.o-no-block
 | 
				
			||||||
 | 
					        +item #[+a("http://docs.cython.org/en/latest/") Official Cython documentation] (cython.org)
 | 
				
			||||||
 | 
					        +item #[+a("https://explosion.ai/blog/writing-c-in-cython", true) Writing C in Cython] (explosion.ai)
 | 
				
			||||||
 | 
					        +item #[+a("https://explosion.ai/blog/multithreading-with-cython") Multi-threading spaCy’s parser and named entity recogniser] (explosion.ai)
 | 
				
			||||||
							
								
								
									
										141
									
								
								website/api/_architecture/_nn-model.jade
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										141
									
								
								website/api/_architecture/_nn-model.jade
									
									
									
									
									
										Normal file
									
								
							| 
						 | 
					@ -0,0 +1,141 @@
 | 
				
			||||||
 | 
					//- 💫 DOCS > API > ARCHITECTURE > NN MODEL ARCHITECTURE
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					p
 | 
				
			||||||
 | 
					    |  The parsing model is a blend of recent results. The two recent
 | 
				
			||||||
 | 
					    |  inspirations have been the work of Eli Klipperwasser and Yoav Goldberg at
 | 
				
			||||||
 | 
					    |  Bar Ilan#[+fn(1)], and the SyntaxNet team from Google. The foundation of
 | 
				
			||||||
 | 
					    |  the parser is still based on the work of Joakim Nivre#[+fn(2)], who
 | 
				
			||||||
 | 
					    |  introduced the transition-based framework#[+fn(3)], the arc-eager
 | 
				
			||||||
 | 
					    |  transition system, and the imitation learning objective. The model is
 | 
				
			||||||
 | 
					    |  implemented using #[+a(gh("thinc")) Thinc], spaCy's machine learning
 | 
				
			||||||
 | 
					    |  library. We first predict context-sensitive vectors for each word in the
 | 
				
			||||||
 | 
					    |  input:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+code.
 | 
				
			||||||
 | 
					    (embed_lower | embed_prefix | embed_suffix | embed_shape)
 | 
				
			||||||
 | 
					        >> Maxout(token_width)
 | 
				
			||||||
 | 
					        >> convolution ** 4
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					p
 | 
				
			||||||
 | 
					    |  This convolutional layer is shared between the tagger, parser and NER,
 | 
				
			||||||
 | 
					    |  and will also be shared by the future neural lemmatizer. Because the
 | 
				
			||||||
 | 
					    |  parser shares these layers with the tagger, the parser does not require
 | 
				
			||||||
 | 
					    |  tag features. I got this trick from David Weiss's "Stack Combination"
 | 
				
			||||||
 | 
					    |  paper#[+fn(4)].
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					p
 | 
				
			||||||
 | 
					    |  To boost the representation, the tagger actually predicts a "super tag"
 | 
				
			||||||
 | 
					    |  with POS, morphology and dependency label#[+fn(5)]. The tagger predicts
 | 
				
			||||||
 | 
					    |  these supertags by adding a softmax layer onto the convolutional layer –
 | 
				
			||||||
 | 
					    |  so, we're teaching the convolutional layer to give us a representation
 | 
				
			||||||
 | 
					    |  that's one affine transform from this informative lexical information.
 | 
				
			||||||
 | 
					    |  This is obviously good for the parser (which backprops to the
 | 
				
			||||||
 | 
					    |  convolutions too). The parser model makes a state vector by concatenating
 | 
				
			||||||
 | 
					    |  the vector representations for its context tokens.  The current context
 | 
				
			||||||
 | 
					    |  tokens:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+table
 | 
				
			||||||
 | 
					    +row
 | 
				
			||||||
 | 
					        +cell #[code S0], #[code S1], #[code S2]
 | 
				
			||||||
 | 
					        +cell Top three words on the stack.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    +row
 | 
				
			||||||
 | 
					        +cell #[code B0], #[code B1]
 | 
				
			||||||
 | 
					        +cell First two words of the buffer.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    +row
 | 
				
			||||||
 | 
					        +cell.u-nowrap
 | 
				
			||||||
 | 
					            |  #[code S0L1], #[code S1L1], #[code S2L1], #[code B0L1],
 | 
				
			||||||
 | 
					            |  #[code B1L1]#[br]
 | 
				
			||||||
 | 
					            |  #[code S0L2], #[code S1L2], #[code S2L2], #[code B0L2],
 | 
				
			||||||
 | 
					            |  #[code B1L2]
 | 
				
			||||||
 | 
					        +cell
 | 
				
			||||||
 | 
					            |  Leftmost and second leftmost children of #[code S0], #[code S1],
 | 
				
			||||||
 | 
					            |  #[code S2], #[code B0] and #[code B1].
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    +row
 | 
				
			||||||
 | 
					        +cell.u-nowrap
 | 
				
			||||||
 | 
					            |  #[code S0R1], #[code S1R1], #[code S2R1], #[code B0R1],
 | 
				
			||||||
 | 
					            |  #[code B1R1]#[br]
 | 
				
			||||||
 | 
					            |  #[code S0R2], #[code S1R2], #[code S2R2], #[code B0R2],
 | 
				
			||||||
 | 
					            |  #[code B1R2]
 | 
				
			||||||
 | 
					        +cell
 | 
				
			||||||
 | 
					            |  Rightmost and second rightmost children of #[code S0], #[code S1],
 | 
				
			||||||
 | 
					            |  #[code S2], #[code B0] and #[code B1].
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					p
 | 
				
			||||||
 | 
					    |  This makes the state vector quite long: #[code 13*T], where #[code T] is
 | 
				
			||||||
 | 
					    |  the token vector width (128 is working well). Fortunately, there's a way
 | 
				
			||||||
 | 
					    |  to structure the computation to save some expense (and make it more
 | 
				
			||||||
 | 
					    |  GPU-friendly).
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					p
 | 
				
			||||||
 | 
					    |  The parser typically visits #[code 2*N] states for a sentence of length
 | 
				
			||||||
 | 
					    |  #[code N] (although it may visit more, if it back-tracks with a
 | 
				
			||||||
 | 
					    |  non-monotonic transition#[+fn(4)]). A naive implementation would require
 | 
				
			||||||
 | 
					    |  #[code 2*N (B, 13*T) @ (13*T, H)] matrix multiplications for a batch of
 | 
				
			||||||
 | 
					    |  size #[code B]. We can instead perform one #[code (B*N, T) @ (T, 13*H)]
 | 
				
			||||||
 | 
					    |  multiplication, to pre-compute the hidden weights for each positional
 | 
				
			||||||
 | 
					    |  feature with respect to the words in the batch. (Note that our token
 | 
				
			||||||
 | 
					    |  vectors come from the CNN — so we can't play this trick over the
 | 
				
			||||||
 | 
					    |  vocabulary. That's how Stanford's NN parser#[+fn(3)] works — and why its
 | 
				
			||||||
 | 
					    |  model is so big.)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					p
 | 
				
			||||||
 | 
					    |  This pre-computation strategy allows a nice compromise between
 | 
				
			||||||
 | 
					    |  GPU-friendliness and implementation simplicity. The CNN and the wide
 | 
				
			||||||
 | 
					    |  lower layer are computed on the GPU, and then the precomputed hidden
 | 
				
			||||||
 | 
					    |  weights are moved to the CPU, before we start the transition-based
 | 
				
			||||||
 | 
					    |  parsing process. This makes a lot of things much easier. We don't have to
 | 
				
			||||||
 | 
					    |  worry about variable-length batch sizes, and we don't have to implement
 | 
				
			||||||
 | 
					    |  the dynamic oracle in CUDA to train.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					p
 | 
				
			||||||
 | 
					    |  Currently the parser's loss function is multilabel log loss#[+fn(6)], as
 | 
				
			||||||
 | 
					    |  the dynamic oracle allows multiple states to be 0 cost. This is defined
 | 
				
			||||||
 | 
					    |  as follows, where #[code gZ] is the sum of the scores assigned to gold
 | 
				
			||||||
 | 
					    |  classes:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+code.
 | 
				
			||||||
 | 
					    (exp(score) / Z) - (exp(score) / gZ)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+bibliography
 | 
				
			||||||
 | 
					    +item
 | 
				
			||||||
 | 
					        |  #[+a("https://www.semanticscholar.org/paper/Simple-and-Accurate-Dependency-Parsing-Using-Bidir-Kiperwasser-Goldberg/3cf31ecb2724b5088783d7c96a5fc0d5604cbf41") Simple and Accurate Dependency Parsing Using Bidirectional LSTM Feature Representations]
 | 
				
			||||||
 | 
					        br
 | 
				
			||||||
 | 
					        |  Eliyahu Kiperwasser, Yoav Goldberg. (2016)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    +item
 | 
				
			||||||
 | 
					        |  #[+a("https://www.semanticscholar.org/paper/A-Dynamic-Oracle-for-Arc-Eager-Dependency-Parsing-Goldberg-Nivre/22697256ec19ecc3e14fcfc63624a44cf9c22df4") A Dynamic Oracle for Arc-Eager Dependency Parsing]
 | 
				
			||||||
 | 
					        br
 | 
				
			||||||
 | 
					        |  Yoav Goldberg, Joakim Nivre (2012)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    +item
 | 
				
			||||||
 | 
					        |  #[+a("https://explosion.ai/blog/parsing-english-in-python") Parsing English in 500 Lines of Python]
 | 
				
			||||||
 | 
					        br
 | 
				
			||||||
 | 
					        |  Matthew Honnibal (2013)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    +item
 | 
				
			||||||
 | 
					        |  #[+a("https://www.semanticscholar.org/paper/Stack-propagation-Improved-Representation-Learning-Zhang-Weiss/0c133f79b23e8c680891d2e49a66f0e3d37f1466") Stack-propagation: Improved Representation Learning for Syntax]
 | 
				
			||||||
 | 
					        br
 | 
				
			||||||
 | 
					        |  Yuan Zhang, David Weiss (2016)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    +item
 | 
				
			||||||
 | 
					        |  #[+a("https://www.semanticscholar.org/paper/Deep-multi-task-learning-with-low-level-tasks-supe-S%C3%B8gaard-Goldberg/03ad06583c9721855ccd82c3d969a01360218d86") Deep multi-task learning with low level tasks supervised at lower layers]
 | 
				
			||||||
 | 
					        br
 | 
				
			||||||
 | 
					        |  Anders Søgaard, Yoav Goldberg (2016)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    +item
 | 
				
			||||||
 | 
					        |  #[+a("https://www.semanticscholar.org/paper/An-Improved-Non-monotonic-Transition-System-for-De-Honnibal-Johnson/4094cee47ade13b77b5ab4d2e6cb9dd2b8a2917c") An Improved Non-monotonic Transition System for Dependency Parsing]
 | 
				
			||||||
 | 
					        br
 | 
				
			||||||
 | 
					        |  Matthew Honnibal, Mark Johnson (2015)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    +item
 | 
				
			||||||
 | 
					        |  #[+a("http://cs.stanford.edu/people/danqi/papers/emnlp2014.pdf") A Fast and Accurate Dependency Parser using Neural Networks]
 | 
				
			||||||
 | 
					        br
 | 
				
			||||||
 | 
					        |  Danqi Cheng, Christopher D. Manning (2014)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    +item
 | 
				
			||||||
 | 
					        |  #[+a("https://www.semanticscholar.org/paper/Parsing-the-Wall-Street-Journal-using-a-Lexical-Fu-Riezler-King/0ad07862a91cd59b7eb5de38267e47725a62b8b2") Parsing the Wall Street Journal using a Lexical-Functional Grammar and Discriminative Estimation Techniques]
 | 
				
			||||||
 | 
					        br
 | 
				
			||||||
 | 
					        |  Stefan Riezler et al. (2002)
 | 
				
			||||||
| 
						 | 
					@ -1,29 +1,32 @@
 | 
				
			||||||
{
 | 
					{
 | 
				
			||||||
    "sidebar": {
 | 
					    "sidebar": {
 | 
				
			||||||
        "Introduction": {
 | 
					        "Overview": {
 | 
				
			||||||
            "Facts & Figures": "./",
 | 
					            "Architecture": "./",
 | 
				
			||||||
            "Languages": "language-models",
 | 
					            "Annotation Specs": "annotation",
 | 
				
			||||||
            "Annotation Specs": "annotation"
 | 
					            "Functions": "top-level"
 | 
				
			||||||
        },
 | 
					        },
 | 
				
			||||||
        "Top-level": {
 | 
					        "Containers": {
 | 
				
			||||||
            "spacy": "spacy",
 | 
					 | 
				
			||||||
            "displacy": "displacy",
 | 
					 | 
				
			||||||
            "Utility Functions": "util",
 | 
					 | 
				
			||||||
            "Command line": "cli"
 | 
					 | 
				
			||||||
        },
 | 
					 | 
				
			||||||
        "Classes": {
 | 
					 | 
				
			||||||
            "Doc": "doc",
 | 
					            "Doc": "doc",
 | 
				
			||||||
            "Token": "token",
 | 
					            "Token": "token",
 | 
				
			||||||
            "Span": "span",
 | 
					            "Span": "span",
 | 
				
			||||||
 | 
					            "Lexeme": "lexeme"
 | 
				
			||||||
 | 
					        },
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					        "Pipeline": {
 | 
				
			||||||
            "Language": "language",
 | 
					            "Language": "language",
 | 
				
			||||||
            "Tokenizer": "tokenizer",
 | 
					            "Pipe": "pipe",
 | 
				
			||||||
            "Tensorizer": "tensorizer",
 | 
					            "Tensorizer": "tensorizer",
 | 
				
			||||||
            "Tagger": "tagger",
 | 
					            "Tagger": "tagger",
 | 
				
			||||||
            "DependencyParser": "dependencyparser",
 | 
					            "DependencyParser": "dependencyparser",
 | 
				
			||||||
            "EntityRecognizer": "entityrecognizer",
 | 
					            "EntityRecognizer": "entityrecognizer",
 | 
				
			||||||
            "TextCategorizer": "textcategorizer",
 | 
					            "TextCategorizer": "textcategorizer",
 | 
				
			||||||
 | 
					            "Tokenizer": "tokenizer",
 | 
				
			||||||
 | 
					            "Lemmatizer": "lemmatizer",
 | 
				
			||||||
            "Matcher": "matcher",
 | 
					            "Matcher": "matcher",
 | 
				
			||||||
            "Lexeme": "lexeme",
 | 
					            "PhraseMatcher": "phrasematcher"
 | 
				
			||||||
 | 
					        },
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					        "Other": {
 | 
				
			||||||
            "Vocab": "vocab",
 | 
					            "Vocab": "vocab",
 | 
				
			||||||
            "StringStore": "stringstore",
 | 
					            "StringStore": "stringstore",
 | 
				
			||||||
            "Vectors": "vectors",
 | 
					            "Vectors": "vectors",
 | 
				
			||||||
| 
						 | 
					@ -34,52 +37,37 @@
 | 
				
			||||||
    },
 | 
					    },
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    "index": {
 | 
					    "index": {
 | 
				
			||||||
        "title": "Facts & Figures",
 | 
					        "title": "Architecture",
 | 
				
			||||||
        "next": "language-models"
 | 
					        "next": "annotation",
 | 
				
			||||||
 | 
					        "menu": {
 | 
				
			||||||
 | 
					            "Basics": "basics",
 | 
				
			||||||
 | 
					            "Neural Network Model": "nn-model",
 | 
				
			||||||
 | 
					            "Cython Conventions": "cython"
 | 
				
			||||||
 | 
					        }
 | 
				
			||||||
    },
 | 
					    },
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    "language-models": {
 | 
					    "top-level": {
 | 
				
			||||||
        "title": "Languages",
 | 
					        "title": "Top-level Functions",
 | 
				
			||||||
        "next": "philosophy"
 | 
					        "menu": {
 | 
				
			||||||
    },
 | 
					            "spacy": "spacy",
 | 
				
			||||||
 | 
					            "displacy": "displacy",
 | 
				
			||||||
    "philosophy": {
 | 
					            "Utility Functions": "util",
 | 
				
			||||||
        "title": "Philosophy"
 | 
					            "Compatibility": "compat",
 | 
				
			||||||
    },
 | 
					            "Command Line": "cli"
 | 
				
			||||||
 | 
					        }
 | 
				
			||||||
    "spacy": {
 | 
					 | 
				
			||||||
        "title": "spaCy top-level functions",
 | 
					 | 
				
			||||||
        "source": "spacy/__init__.py",
 | 
					 | 
				
			||||||
        "next": "displacy"
 | 
					 | 
				
			||||||
    },
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
    "displacy": {
 | 
					 | 
				
			||||||
        "title": "displaCy",
 | 
					 | 
				
			||||||
        "tag": "module",
 | 
					 | 
				
			||||||
        "source": "spacy/displacy",
 | 
					 | 
				
			||||||
        "next": "util"
 | 
					 | 
				
			||||||
    },
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
    "util": {
 | 
					 | 
				
			||||||
        "title": "Utility Functions",
 | 
					 | 
				
			||||||
        "source": "spacy/util.py",
 | 
					 | 
				
			||||||
        "next": "cli"
 | 
					 | 
				
			||||||
    },
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
    "cli": {
 | 
					 | 
				
			||||||
        "title": "Command Line Interface",
 | 
					 | 
				
			||||||
        "source": "spacy/cli"
 | 
					 | 
				
			||||||
    },
 | 
					    },
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    "language": {
 | 
					    "language": {
 | 
				
			||||||
        "title": "Language",
 | 
					        "title": "Language",
 | 
				
			||||||
        "tag": "class",
 | 
					        "tag": "class",
 | 
				
			||||||
 | 
					        "teaser": "A text-processing pipeline.",
 | 
				
			||||||
        "source": "spacy/language.py"
 | 
					        "source": "spacy/language.py"
 | 
				
			||||||
    },
 | 
					    },
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    "doc": {
 | 
					    "doc": {
 | 
				
			||||||
        "title": "Doc",
 | 
					        "title": "Doc",
 | 
				
			||||||
        "tag": "class",
 | 
					        "tag": "class",
 | 
				
			||||||
 | 
					        "teaser": "A container for accessing linguistic annotations.",
 | 
				
			||||||
        "source": "spacy/tokens/doc.pyx"
 | 
					        "source": "spacy/tokens/doc.pyx"
 | 
				
			||||||
    },
 | 
					    },
 | 
				
			||||||
 | 
					
 | 
				
			||||||
| 
						 | 
					@ -103,6 +91,7 @@
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    "vocab": {
 | 
					    "vocab": {
 | 
				
			||||||
        "title": "Vocab",
 | 
					        "title": "Vocab",
 | 
				
			||||||
 | 
					        "teaser": "A storage class for vocabulary and other data shared across a language.",
 | 
				
			||||||
        "tag": "class",
 | 
					        "tag": "class",
 | 
				
			||||||
        "source": "spacy/vocab.pyx"
 | 
					        "source": "spacy/vocab.pyx"
 | 
				
			||||||
    },
 | 
					    },
 | 
				
			||||||
| 
						 | 
					@ -115,10 +104,27 @@
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    "matcher": {
 | 
					    "matcher": {
 | 
				
			||||||
        "title": "Matcher",
 | 
					        "title": "Matcher",
 | 
				
			||||||
 | 
					        "teaser": "Match sequences of tokens, based on pattern rules.",
 | 
				
			||||||
        "tag": "class",
 | 
					        "tag": "class",
 | 
				
			||||||
        "source": "spacy/matcher.pyx"
 | 
					        "source": "spacy/matcher.pyx"
 | 
				
			||||||
    },
 | 
					    },
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    "phrasematcher": {
 | 
				
			||||||
 | 
					        "title": "PhraseMatcher",
 | 
				
			||||||
 | 
					        "teaser": "Match sequences of tokens, based on documents.",
 | 
				
			||||||
 | 
					        "tag": "class",
 | 
				
			||||||
 | 
					        "tag_new": 2,
 | 
				
			||||||
 | 
					        "source": "spacy/matcher.pyx"
 | 
				
			||||||
 | 
					    },
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    "pipe": {
 | 
				
			||||||
 | 
					        "title": "Pipe",
 | 
				
			||||||
 | 
					        "teaser": "Abstract base class defining the API for pipeline components.",
 | 
				
			||||||
 | 
					        "tag": "class",
 | 
				
			||||||
 | 
					        "tag_new": 2,
 | 
				
			||||||
 | 
					        "source": "spacy/pipeline.pyx"
 | 
				
			||||||
 | 
					    },
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    "dependenyparser": {
 | 
					    "dependenyparser": {
 | 
				
			||||||
        "title": "DependencyParser",
 | 
					        "title": "DependencyParser",
 | 
				
			||||||
        "tag": "class",
 | 
					        "tag": "class",
 | 
				
			||||||
| 
						 | 
					@ -127,18 +133,22 @@
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    "entityrecognizer": {
 | 
					    "entityrecognizer": {
 | 
				
			||||||
        "title": "EntityRecognizer",
 | 
					        "title": "EntityRecognizer",
 | 
				
			||||||
 | 
					        "teaser": "Annotate named entities on documents.",
 | 
				
			||||||
        "tag": "class",
 | 
					        "tag": "class",
 | 
				
			||||||
        "source": "spacy/pipeline.pyx"
 | 
					        "source": "spacy/pipeline.pyx"
 | 
				
			||||||
    },
 | 
					    },
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    "textcategorizer": {
 | 
					    "textcategorizer": {
 | 
				
			||||||
        "title": "TextCategorizer",
 | 
					        "title": "TextCategorizer",
 | 
				
			||||||
 | 
					        "teaser": "Add text categorization models to spaCy pipelines.",
 | 
				
			||||||
        "tag": "class",
 | 
					        "tag": "class",
 | 
				
			||||||
 | 
					        "tag_new": 2,
 | 
				
			||||||
        "source": "spacy/pipeline.pyx"
 | 
					        "source": "spacy/pipeline.pyx"
 | 
				
			||||||
    },
 | 
					    },
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    "dependencyparser": {
 | 
					    "dependencyparser": {
 | 
				
			||||||
        "title": "DependencyParser",
 | 
					        "title": "DependencyParser",
 | 
				
			||||||
 | 
					        "teaser": "Annotate syntactic dependencies on documents.",
 | 
				
			||||||
        "tag": "class",
 | 
					        "tag": "class",
 | 
				
			||||||
        "source": "spacy/pipeline.pyx"
 | 
					        "source": "spacy/pipeline.pyx"
 | 
				
			||||||
    },
 | 
					    },
 | 
				
			||||||
| 
						 | 
					@ -149,15 +159,23 @@
 | 
				
			||||||
        "source": "spacy/tokenizer.pyx"
 | 
					        "source": "spacy/tokenizer.pyx"
 | 
				
			||||||
    },
 | 
					    },
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    "lemmatizer": {
 | 
				
			||||||
 | 
					        "title": "Lemmatizer",
 | 
				
			||||||
 | 
					        "tag": "class"
 | 
				
			||||||
 | 
					    },
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    "tagger": {
 | 
					    "tagger": {
 | 
				
			||||||
        "title": "Tagger",
 | 
					        "title": "Tagger",
 | 
				
			||||||
 | 
					        "teaser": "Annotate part-of-speech tags on documents.",
 | 
				
			||||||
        "tag": "class",
 | 
					        "tag": "class",
 | 
				
			||||||
        "source": "spacy/pipeline.pyx"
 | 
					        "source": "spacy/pipeline.pyx"
 | 
				
			||||||
    },
 | 
					    },
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    "tensorizer": {
 | 
					    "tensorizer": {
 | 
				
			||||||
        "title": "Tensorizer",
 | 
					        "title": "Tensorizer",
 | 
				
			||||||
 | 
					        "teaser": "Add a tensor with position-sensitive meaning representations to a document.",
 | 
				
			||||||
        "tag": "class",
 | 
					        "tag": "class",
 | 
				
			||||||
 | 
					        "tag_new": 2,
 | 
				
			||||||
        "source": "spacy/pipeline.pyx"
 | 
					        "source": "spacy/pipeline.pyx"
 | 
				
			||||||
    },
 | 
					    },
 | 
				
			||||||
 | 
					
 | 
				
			||||||
| 
						 | 
					@ -169,23 +187,38 @@
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    "goldcorpus": {
 | 
					    "goldcorpus": {
 | 
				
			||||||
        "title": "GoldCorpus",
 | 
					        "title": "GoldCorpus",
 | 
				
			||||||
 | 
					        "teaser": "An annotated corpus, using the JSON file format.",
 | 
				
			||||||
        "tag": "class",
 | 
					        "tag": "class",
 | 
				
			||||||
 | 
					        "tag_new": 2,
 | 
				
			||||||
        "source": "spacy/gold.pyx"
 | 
					        "source": "spacy/gold.pyx"
 | 
				
			||||||
    },
 | 
					    },
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    "binder": {
 | 
					    "binder": {
 | 
				
			||||||
        "title": "Binder",
 | 
					        "title": "Binder",
 | 
				
			||||||
        "tag": "class",
 | 
					        "tag": "class",
 | 
				
			||||||
 | 
					        "tag_new": 2,
 | 
				
			||||||
        "source": "spacy/tokens/binder.pyx"
 | 
					        "source": "spacy/tokens/binder.pyx"
 | 
				
			||||||
    },
 | 
					    },
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    "vectors": {
 | 
					    "vectors": {
 | 
				
			||||||
        "title": "Vectors",
 | 
					        "title": "Vectors",
 | 
				
			||||||
 | 
					        "teaser": "Store, save and load word vectors.",
 | 
				
			||||||
        "tag": "class",
 | 
					        "tag": "class",
 | 
				
			||||||
 | 
					        "tag_new": 2,
 | 
				
			||||||
        "source": "spacy/vectors.pyx"
 | 
					        "source": "spacy/vectors.pyx"
 | 
				
			||||||
    },
 | 
					    },
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    "annotation": {
 | 
					    "annotation": {
 | 
				
			||||||
        "title": "Annotation Specifications"
 | 
					        "title": "Annotation Specifications",
 | 
				
			||||||
 | 
					        "teaser": "Schemes used for labels, tags and training data.",
 | 
				
			||||||
 | 
					        "menu": {
 | 
				
			||||||
 | 
					            "Tokenization": "tokenization",
 | 
				
			||||||
 | 
					            "Sentence Boundaries": "sbd",
 | 
				
			||||||
 | 
					            "POS Tagging": "pos-tagging",
 | 
				
			||||||
 | 
					            "Lemmatization": "lemmatization",
 | 
				
			||||||
 | 
					            "Dependencies": "dependency-parsing",
 | 
				
			||||||
 | 
					            "Named Entities": "named-entities",
 | 
				
			||||||
 | 
					            "Training Data": "training"
 | 
				
			||||||
 | 
					        }
 | 
				
			||||||
    }
 | 
					    }
 | 
				
			||||||
}
 | 
					}
 | 
				
			||||||
| 
						 | 
					@ -1,26 +1,17 @@
 | 
				
			||||||
//- 💫 DOCS > USAGE > COMMAND LINE INTERFACE
 | 
					//- 💫 DOCS > API > TOP-LEVEL > COMMAND LINE INTERFACE
 | 
				
			||||||
 | 
					 | 
				
			||||||
include ../../_includes/_mixins
 | 
					 | 
				
			||||||
 | 
					
 | 
				
			||||||
p
 | 
					p
 | 
				
			||||||
    |  As of v1.7.0, spaCy comes with new command line helpers to download and
 | 
					    |  As of v1.7.0, spaCy comes with new command line helpers to download and
 | 
				
			||||||
    |  link models and show useful debugging information. For a list of available
 | 
					    |  link models and show useful debugging information. For a list of available
 | 
				
			||||||
    |  commands, type #[code spacy --help].
 | 
					    |  commands, type #[code spacy --help].
 | 
				
			||||||
 | 
					
 | 
				
			||||||
+infobox("⚠️ Deprecation note")
 | 
					+h(3, "download") Download
 | 
				
			||||||
    |  As of spaCy 2.0, the #[code model] command to initialise a model data
 | 
					 | 
				
			||||||
    |  directory is deprecated. The command was only necessary because previous
 | 
					 | 
				
			||||||
    |  versions of spaCy expected a model directory to already be set up. This
 | 
					 | 
				
			||||||
    |  has since been changed, so you can use the #[+api("cli#train") #[code train]]
 | 
					 | 
				
			||||||
    |  command straight away.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
+h(2, "download") Download
 | 
					 | 
				
			||||||
 | 
					
 | 
				
			||||||
p
 | 
					p
 | 
				
			||||||
    |  Download #[+a("/docs/usage/models") models] for spaCy. The downloader finds the
 | 
					    |  Download #[+a("/usage/models") models] for spaCy. The downloader finds the
 | 
				
			||||||
    |  best-matching compatible version, uses pip to download the model as a
 | 
					    |  best-matching compatible version, uses pip to download the model as a
 | 
				
			||||||
    |  package and automatically creates a
 | 
					    |  package and automatically creates a
 | 
				
			||||||
    |  #[+a("/docs/usage/models#usage") shortcut link] to load the model by name.
 | 
					    |  #[+a("/usage/models#usage") shortcut link] to load the model by name.
 | 
				
			||||||
    |  Direct downloads don't perform any compatibility checks and require the
 | 
					    |  Direct downloads don't perform any compatibility checks and require the
 | 
				
			||||||
    |  model name to be specified with its version (e.g., #[code en_core_web_sm-1.2.0]).
 | 
					    |  model name to be specified with its version (e.g., #[code en_core_web_sm-1.2.0]).
 | 
				
			||||||
 | 
					
 | 
				
			||||||
| 
						 | 
					@ -49,15 +40,15 @@ p
 | 
				
			||||||
    |  detailed messages in case things go wrong. It's #[strong not recommended]
 | 
					    |  detailed messages in case things go wrong. It's #[strong not recommended]
 | 
				
			||||||
    |  to use this command as part of an automated process. If you know which
 | 
					    |  to use this command as part of an automated process. If you know which
 | 
				
			||||||
    |  model your project needs, you should consider a
 | 
					    |  model your project needs, you should consider a
 | 
				
			||||||
    |  #[+a("/docs/usage/models#download-pip") direct download via pip], or
 | 
					    |  #[+a("/usage/models#download-pip") direct download via pip], or
 | 
				
			||||||
    |  uploading the model to a local PyPi installation and fetching it straight
 | 
					    |  uploading the model to a local PyPi installation and fetching it straight
 | 
				
			||||||
    |  from there. This will also allow you to add it as a versioned package
 | 
					    |  from there. This will also allow you to add it as a versioned package
 | 
				
			||||||
    |  dependency to your project.
 | 
					    |  dependency to your project.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
+h(2, "link") Link
 | 
					+h(3, "link") Link
 | 
				
			||||||
 | 
					
 | 
				
			||||||
p
 | 
					p
 | 
				
			||||||
    |  Create a #[+a("/docs/usage/models#usage") shortcut link] for a model,
 | 
					    |  Create a #[+a("/usage/models#usage") shortcut link] for a model,
 | 
				
			||||||
    |  either a Python package or a local directory. This will let you load
 | 
					    |  either a Python package or a local directory. This will let you load
 | 
				
			||||||
    |  models from any location using a custom name via
 | 
					    |  models from any location using a custom name via
 | 
				
			||||||
    |  #[+api("spacy#load") #[code spacy.load()]].
 | 
					    |  #[+api("spacy#load") #[code spacy.load()]].
 | 
				
			||||||
| 
						 | 
					@ -95,7 +86,7 @@ p
 | 
				
			||||||
        +cell flag
 | 
					        +cell flag
 | 
				
			||||||
        +cell Show help message and available arguments.
 | 
					        +cell Show help message and available arguments.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
+h(2, "info") Info
 | 
					+h(3, "info") Info
 | 
				
			||||||
 | 
					
 | 
				
			||||||
p
 | 
					p
 | 
				
			||||||
    |  Print information about your spaCy installation, models and local setup,
 | 
					    |  Print information about your spaCy installation, models and local setup,
 | 
				
			||||||
| 
						 | 
					@ -122,15 +113,15 @@ p
 | 
				
			||||||
        +cell flag
 | 
					        +cell flag
 | 
				
			||||||
        +cell Show help message and available arguments.
 | 
					        +cell Show help message and available arguments.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
+h(2, "convert") Convert
 | 
					+h(3, "convert") Convert
 | 
				
			||||||
 | 
					
 | 
				
			||||||
p
 | 
					p
 | 
				
			||||||
    |  Convert files into spaCy's #[+a("/docs/api/annotation#json-input") JSON format]
 | 
					    |  Convert files into spaCy's #[+a("/api/annotation#json-input") JSON format]
 | 
				
			||||||
    |  for use with the #[code train] command and other experiment management
 | 
					    |  for use with the #[code train] command and other experiment management
 | 
				
			||||||
    |  functions. The right converter is chosen based on the file extension of
 | 
					    |  functions. The right converter is chosen based on the file extension of
 | 
				
			||||||
    |  the input file. Currently only supports #[code .conllu].
 | 
					    |  the input file. Currently only supports #[code .conllu].
 | 
				
			||||||
 | 
					
 | 
				
			||||||
+code(false, "bash", "$").
 | 
					+code(false, "bash", "$", false, false, true).
 | 
				
			||||||
    spacy convert [input_file] [output_dir] [--n-sents] [--morphology]
 | 
					    spacy convert [input_file] [output_dir] [--n-sents] [--morphology]
 | 
				
			||||||
 | 
					
 | 
				
			||||||
+table(["Argument", "Type", "Description"])
 | 
					+table(["Argument", "Type", "Description"])
 | 
				
			||||||
| 
						 | 
					@ -159,14 +150,18 @@ p
 | 
				
			||||||
        +cell flag
 | 
					        +cell flag
 | 
				
			||||||
        +cell Show help message and available arguments.
 | 
					        +cell Show help message and available arguments.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
+h(2, "train") Train
 | 
					+h(3, "train") Train
 | 
				
			||||||
 | 
					
 | 
				
			||||||
p
 | 
					p
 | 
				
			||||||
    |  Train a model. Expects data in spaCy's
 | 
					    |  Train a model. Expects data in spaCy's
 | 
				
			||||||
    |  #[+a("/docs/api/annotation#json-input") JSON format].
 | 
					    |  #[+a("/api/annotation#json-input") JSON format]. On each epoch, a model
 | 
				
			||||||
 | 
					    |  will be saved out to the directory. Accuracy scores and model details
 | 
				
			||||||
 | 
					    |  will be added to a #[+a("/usage/training#models-generating") #[code meta.json]]
 | 
				
			||||||
 | 
					    |  to allow packaging the model using the
 | 
				
			||||||
 | 
					    |  #[+api("cli#package") #[code package]] command.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
+code(false, "bash", "$").
 | 
					+code(false, "bash", "$", false, false, true).
 | 
				
			||||||
    spacy train [lang] [output_dir] [train_data] [dev_data] [--n-iter] [--n-sents] [--use-gpu] [--no-tagger] [--no-parser] [--no-entities]
 | 
					    spacy train [lang] [output_dir] [train_data] [dev_data] [--n-iter] [--n-sents] [--use-gpu] [--meta-path] [--vectors] [--no-tagger] [--no-parser] [--no-entities] [--gold-preproc]
 | 
				
			||||||
 | 
					
 | 
				
			||||||
+table(["Argument", "Type", "Description"])
 | 
					+table(["Argument", "Type", "Description"])
 | 
				
			||||||
    +row
 | 
					    +row
 | 
				
			||||||
| 
						 | 
					@ -204,6 +199,27 @@ p
 | 
				
			||||||
        +cell option
 | 
					        +cell option
 | 
				
			||||||
        +cell Use GPU.
 | 
					        +cell Use GPU.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    +row
 | 
				
			||||||
 | 
					        +cell #[code --vectors], #[code -v]
 | 
				
			||||||
 | 
					        +cell option
 | 
				
			||||||
 | 
					        +cell Model to load vectors from.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    +row
 | 
				
			||||||
 | 
					        +cell #[code --meta-path], #[code -m]
 | 
				
			||||||
 | 
					        +cell option
 | 
				
			||||||
 | 
					        +cell
 | 
				
			||||||
 | 
					            |  #[+tag-new(2)] Optional path to model
 | 
				
			||||||
 | 
					            |  #[+a("/usage/training#models-generating") #[code meta.json]].
 | 
				
			||||||
 | 
					            |  All relevant properties like #[code lang], #[code pipeline] and
 | 
				
			||||||
 | 
					            |  #[code spacy_version] will be overwritten.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    +row
 | 
				
			||||||
 | 
					        +cell #[code --version], #[code -V]
 | 
				
			||||||
 | 
					        +cell option
 | 
				
			||||||
 | 
					        +cell
 | 
				
			||||||
 | 
					            |  Model version. Will be written out to the model's
 | 
				
			||||||
 | 
					            |  #[code meta.json] after training.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    +row
 | 
					    +row
 | 
				
			||||||
        +cell #[code --no-tagger], #[code -T]
 | 
					        +cell #[code --no-tagger], #[code -T]
 | 
				
			||||||
        +cell flag
 | 
					        +cell flag
 | 
				
			||||||
| 
						 | 
					@ -219,12 +235,18 @@ p
 | 
				
			||||||
        +cell flag
 | 
					        +cell flag
 | 
				
			||||||
        +cell Don't train NER.
 | 
					        +cell Don't train NER.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    +row
 | 
				
			||||||
 | 
					        +cell #[code --gold-preproc], #[code -G]
 | 
				
			||||||
 | 
					        +cell flag
 | 
				
			||||||
 | 
					        +cell Use gold preprocessing.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    +row
 | 
					    +row
 | 
				
			||||||
        +cell #[code --help], #[code -h]
 | 
					        +cell #[code --help], #[code -h]
 | 
				
			||||||
        +cell flag
 | 
					        +cell flag
 | 
				
			||||||
        +cell Show help message and available arguments.
 | 
					        +cell Show help message and available arguments.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
+h(3, "train-hyperparams") Environment variables for hyperparameters
 | 
					+h(4, "train-hyperparams") Environment variables for hyperparameters
 | 
				
			||||||
 | 
					    +tag-new(2)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
p
 | 
					p
 | 
				
			||||||
    |  spaCy lets you set hyperparameters for training via environment variables.
 | 
					    |  spaCy lets you set hyperparameters for training via environment variables.
 | 
				
			||||||
| 
						 | 
					@ -236,98 +258,96 @@ p
 | 
				
			||||||
+code(false, "bash").
 | 
					+code(false, "bash").
 | 
				
			||||||
    parser_hidden_depth=2 parser_maxout_pieces=1 train-parser
 | 
					    parser_hidden_depth=2 parser_maxout_pieces=1 train-parser
 | 
				
			||||||
 | 
					
 | 
				
			||||||
+under-construction
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
+table(["Name", "Description", "Default"])
 | 
					+table(["Name", "Description", "Default"])
 | 
				
			||||||
    +row
 | 
					    +row
 | 
				
			||||||
        +cell #[code dropout_from]
 | 
					        +cell #[code dropout_from]
 | 
				
			||||||
        +cell
 | 
					        +cell Initial dropout rate.
 | 
				
			||||||
        +cell #[code 0.2]
 | 
					        +cell #[code 0.2]
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    +row
 | 
					    +row
 | 
				
			||||||
        +cell #[code dropout_to]
 | 
					        +cell #[code dropout_to]
 | 
				
			||||||
        +cell
 | 
					        +cell Final dropout rate.
 | 
				
			||||||
        +cell #[code 0.2]
 | 
					        +cell #[code 0.2]
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    +row
 | 
					    +row
 | 
				
			||||||
        +cell #[code dropout_decay]
 | 
					        +cell #[code dropout_decay]
 | 
				
			||||||
        +cell
 | 
					        +cell Rate of dropout change.
 | 
				
			||||||
        +cell #[code 0.0]
 | 
					        +cell #[code 0.0]
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    +row
 | 
					    +row
 | 
				
			||||||
        +cell #[code batch_from]
 | 
					        +cell #[code batch_from]
 | 
				
			||||||
        +cell
 | 
					        +cell Initial batch size.
 | 
				
			||||||
        +cell #[code 1]
 | 
					        +cell #[code 1]
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    +row
 | 
					    +row
 | 
				
			||||||
        +cell #[code batch_to]
 | 
					        +cell #[code batch_to]
 | 
				
			||||||
        +cell
 | 
					        +cell Final batch size.
 | 
				
			||||||
        +cell #[code 64]
 | 
					        +cell #[code 64]
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    +row
 | 
					    +row
 | 
				
			||||||
        +cell #[code batch_compound]
 | 
					        +cell #[code batch_compound]
 | 
				
			||||||
        +cell
 | 
					        +cell Rate of batch size acceleration.
 | 
				
			||||||
        +cell #[code 1.001]
 | 
					        +cell #[code 1.001]
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    +row
 | 
					    +row
 | 
				
			||||||
        +cell #[code token_vector_width]
 | 
					        +cell #[code token_vector_width]
 | 
				
			||||||
        +cell
 | 
					        +cell Width of embedding tables and convolutional layers.
 | 
				
			||||||
        +cell #[code 128]
 | 
					        +cell #[code 128]
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    +row
 | 
					    +row
 | 
				
			||||||
        +cell #[code embed_size]
 | 
					        +cell #[code embed_size]
 | 
				
			||||||
        +cell
 | 
					        +cell Number of rows in embedding tables.
 | 
				
			||||||
        +cell #[code 7500]
 | 
					        +cell #[code 7500]
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    +row
 | 
					    +row
 | 
				
			||||||
        +cell #[code parser_maxout_pieces]
 | 
					        +cell #[code parser_maxout_pieces]
 | 
				
			||||||
        +cell
 | 
					        +cell Number of pieces in the parser's and NER's first maxout layer.
 | 
				
			||||||
        +cell #[code 2]
 | 
					        +cell #[code 2]
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    +row
 | 
					    +row
 | 
				
			||||||
        +cell #[code parser_hidden_depth]
 | 
					        +cell #[code parser_hidden_depth]
 | 
				
			||||||
        +cell
 | 
					        +cell Number of hidden layers in the parser and NER.
 | 
				
			||||||
        +cell #[code 1]
 | 
					        +cell #[code 1]
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    +row
 | 
					    +row
 | 
				
			||||||
        +cell #[code hidden_width]
 | 
					        +cell #[code hidden_width]
 | 
				
			||||||
        +cell
 | 
					        +cell Size of the parser's and NER's hidden layers.
 | 
				
			||||||
        +cell #[code 128]
 | 
					        +cell #[code 128]
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    +row
 | 
					    +row
 | 
				
			||||||
        +cell #[code learn_rate]
 | 
					        +cell #[code learn_rate]
 | 
				
			||||||
        +cell
 | 
					        +cell Learning rate.
 | 
				
			||||||
        +cell #[code 0.001]
 | 
					        +cell #[code 0.001]
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    +row
 | 
					    +row
 | 
				
			||||||
        +cell #[code optimizer_B1]
 | 
					        +cell #[code optimizer_B1]
 | 
				
			||||||
        +cell
 | 
					        +cell Momentum for the Adam solver.
 | 
				
			||||||
        +cell #[code 0.9]
 | 
					        +cell #[code 0.9]
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    +row
 | 
					    +row
 | 
				
			||||||
        +cell #[code optimizer_B2]
 | 
					        +cell #[code optimizer_B2]
 | 
				
			||||||
        +cell
 | 
					        +cell Adagrad-momentum for the Adam solver.
 | 
				
			||||||
        +cell #[code 0.999]
 | 
					        +cell #[code 0.999]
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    +row
 | 
					    +row
 | 
				
			||||||
        +cell #[code optimizer_eps]
 | 
					        +cell #[code optimizer_eps]
 | 
				
			||||||
        +cell
 | 
					        +cell Epsylon value for the Adam solver.
 | 
				
			||||||
        +cell #[code 1e-08]
 | 
					        +cell #[code 1e-08]
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    +row
 | 
					    +row
 | 
				
			||||||
        +cell #[code L2_penalty]
 | 
					        +cell #[code L2_penalty]
 | 
				
			||||||
        +cell
 | 
					        +cell L2 regularisation penalty.
 | 
				
			||||||
        +cell #[code 1e-06]
 | 
					        +cell #[code 1e-06]
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    +row
 | 
					    +row
 | 
				
			||||||
        +cell #[code grad_norm_clip]
 | 
					        +cell #[code grad_norm_clip]
 | 
				
			||||||
        +cell
 | 
					        +cell Gradient L2 norm constraint.
 | 
				
			||||||
        +cell #[code 1.0]
 | 
					        +cell #[code 1.0]
 | 
				
			||||||
 | 
					
 | 
				
			||||||
+h(2, "package") Package
 | 
					+h(3, "package") Package
 | 
				
			||||||
 | 
					
 | 
				
			||||||
p
 | 
					p
 | 
				
			||||||
    |  Generate a #[+a("/docs/usage/saving-loading#generating") model Python package]
 | 
					    |  Generate a #[+a("/usage/training#models-generating") model Python package]
 | 
				
			||||||
    |  from an existing model data directory. All data files are copied over.
 | 
					    |  from an existing model data directory. All data files are copied over.
 | 
				
			||||||
    |  If the path to a meta.json is supplied, or a meta.json is found in the
 | 
					    |  If the path to a meta.json is supplied, or a meta.json is found in the
 | 
				
			||||||
    |  input directory, this file is used. Otherwise, the data can be entered
 | 
					    |  input directory, this file is used. Otherwise, the data can be entered
 | 
				
			||||||
| 
						 | 
					@ -336,8 +356,8 @@ p
 | 
				
			||||||
    |  sure you're always using the latest versions. This means you need to be
 | 
					    |  sure you're always using the latest versions. This means you need to be
 | 
				
			||||||
    |  connected to the internet to use this command.
 | 
					    |  connected to the internet to use this command.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
+code(false, "bash", "$").
 | 
					+code(false, "bash", "$", false, false, true).
 | 
				
			||||||
    spacy package [input_dir] [output_dir] [--meta] [--force]
 | 
					    spacy package [input_dir] [output_dir] [--meta-path] [--create-meta] [--force]
 | 
				
			||||||
 | 
					
 | 
				
			||||||
+table(["Argument", "Type", "Description"])
 | 
					+table(["Argument", "Type", "Description"])
 | 
				
			||||||
    +row
 | 
					    +row
 | 
				
			||||||
| 
						 | 
					@ -353,14 +373,14 @@ p
 | 
				
			||||||
    +row
 | 
					    +row
 | 
				
			||||||
        +cell #[code --meta-path], #[code -m]
 | 
					        +cell #[code --meta-path], #[code -m]
 | 
				
			||||||
        +cell option
 | 
					        +cell option
 | 
				
			||||||
        +cell Path to meta.json file (optional).
 | 
					        +cell #[+tag-new(2)] Path to meta.json file (optional).
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    +row
 | 
					    +row
 | 
				
			||||||
        +cell #[code --create-meta], #[code -c]
 | 
					        +cell #[code --create-meta], #[code -c]
 | 
				
			||||||
        +cell flag
 | 
					        +cell flag
 | 
				
			||||||
        +cell
 | 
					        +cell
 | 
				
			||||||
            |  Create a meta.json file on the command line, even if one already
 | 
					            |  #[+tag-new(2)] Create a meta.json file on the command line, even
 | 
				
			||||||
            |  exists in the directory.
 | 
					            |  if one already exists in the directory.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    +row
 | 
					    +row
 | 
				
			||||||
        +cell #[code --force], #[code -f]
 | 
					        +cell #[code --force], #[code -f]
 | 
				
			||||||
							
								
								
									
										91
									
								
								website/api/_top-level/_compat.jade
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										91
									
								
								website/api/_top-level/_compat.jade
									
									
									
									
									
										Normal file
									
								
							| 
						 | 
					@ -0,0 +1,91 @@
 | 
				
			||||||
 | 
					//- 💫 DOCS > API > TOP-LEVEL > COMPATIBILITY
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					p
 | 
				
			||||||
 | 
					    |  All Python code is written in an
 | 
				
			||||||
 | 
					    |  #[strong intersection of Python 2 and Python 3]. This is easy in Cython,
 | 
				
			||||||
 | 
					    |  but somewhat ugly in Python. Logic that deals with Python or platform
 | 
				
			||||||
 | 
					    |  compatibility only lives in #[code spacy.compat]. To distinguish them from
 | 
				
			||||||
 | 
					    |  the builtin functions, replacement functions are suffixed with an
 | 
				
			||||||
 | 
					    |  undersocre, e.e #[code unicode_]. For specific checks, spaCy uses the
 | 
				
			||||||
 | 
					    |  #[code six] and #[code ftfy] packages.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+aside-code("Example").
 | 
				
			||||||
 | 
					    from spacy.compat import unicode_, json_dumps
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    compatible_unicode = unicode_('hello world')
 | 
				
			||||||
 | 
					    compatible_json = json_dumps({'key': 'value'})
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+table(["Name", "Python 2", "Python 3"])
 | 
				
			||||||
 | 
					    +row
 | 
				
			||||||
 | 
					        +cell #[code compat.bytes_]
 | 
				
			||||||
 | 
					        +cell #[code str]
 | 
				
			||||||
 | 
					        +cell #[code bytes]
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    +row
 | 
				
			||||||
 | 
					        +cell #[code compat.unicode_]
 | 
				
			||||||
 | 
					        +cell #[code unicode]
 | 
				
			||||||
 | 
					        +cell #[code str]
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    +row
 | 
				
			||||||
 | 
					        +cell #[code compat.basestring_]
 | 
				
			||||||
 | 
					        +cell #[code basestring]
 | 
				
			||||||
 | 
					        +cell #[code str]
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    +row
 | 
				
			||||||
 | 
					        +cell #[code compat.input_]
 | 
				
			||||||
 | 
					        +cell #[code raw_input]
 | 
				
			||||||
 | 
					        +cell #[code input]
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    +row
 | 
				
			||||||
 | 
					        +cell #[code compat.json_dumps]
 | 
				
			||||||
 | 
					        +cell #[code ujson.dumps] with #[code .decode('utf8')]
 | 
				
			||||||
 | 
					        +cell #[code ujson.dumps]
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    +row
 | 
				
			||||||
 | 
					        +cell #[code compat.path2str]
 | 
				
			||||||
 | 
					        +cell #[code str(path)] with #[code .decode('utf8')]
 | 
				
			||||||
 | 
					        +cell #[code str(path)]
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+h(3, "is_config") compat.is_config
 | 
				
			||||||
 | 
					    +tag function
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					p
 | 
				
			||||||
 | 
					    |  Check if a specific configuration of Python version and operating system
 | 
				
			||||||
 | 
					    |  matches the user's setup. Mostly used to display targeted error messages.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+aside-code("Example").
 | 
				
			||||||
 | 
					    from spacy.compat import is_config
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    if is_config(python2=True, windows=True):
 | 
				
			||||||
 | 
					        print("You are using Python 2 on Windows.")
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+table(["Name", "Type", "Description"])
 | 
				
			||||||
 | 
					    +row
 | 
				
			||||||
 | 
					        +cell #[code python2]
 | 
				
			||||||
 | 
					        +cell bool
 | 
				
			||||||
 | 
					        +cell spaCy is executed with Python 2.x.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    +row
 | 
				
			||||||
 | 
					        +cell #[code python3]
 | 
				
			||||||
 | 
					        +cell bool
 | 
				
			||||||
 | 
					        +cell spaCy is executed with Python 3.x.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    +row
 | 
				
			||||||
 | 
					        +cell #[code windows]
 | 
				
			||||||
 | 
					        +cell bool
 | 
				
			||||||
 | 
					        +cell spaCy is executed on Windows.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    +row
 | 
				
			||||||
 | 
					        +cell #[code linux]
 | 
				
			||||||
 | 
					        +cell bool
 | 
				
			||||||
 | 
					        +cell spaCy is executed on Linux.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    +row
 | 
				
			||||||
 | 
					        +cell #[code osx]
 | 
				
			||||||
 | 
					        +cell bool
 | 
				
			||||||
 | 
					        +cell spaCy is executed on OS X or macOS.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    +row("foot")
 | 
				
			||||||
 | 
					        +cell returns
 | 
				
			||||||
 | 
					        +cell bool
 | 
				
			||||||
 | 
					        +cell Whether the specified configuration matches the user's platform.
 | 
				
			||||||
| 
						 | 
					@ -1,14 +1,12 @@
 | 
				
			||||||
//- 💫 DOCS > API > DISPLACY
 | 
					//- 💫 DOCS > API > TOP-LEVEL > DISPLACY
 | 
				
			||||||
 | 
					 | 
				
			||||||
include ../../_includes/_mixins
 | 
					 | 
				
			||||||
 | 
					
 | 
				
			||||||
p
 | 
					p
 | 
				
			||||||
    |  As of v2.0, spaCy comes with a built-in visualization suite. For more
 | 
					    |  As of v2.0, spaCy comes with a built-in visualization suite. For more
 | 
				
			||||||
    |  info and examples, see the usage guide on
 | 
					    |  info and examples, see the usage guide on
 | 
				
			||||||
    |  #[+a("/docs/usage/visualizers") visualizing spaCy].
 | 
					    |  #[+a("/usage/visualizers") visualizing spaCy].
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
+h(2, "serve") displacy.serve
 | 
					+h(3, "displacy.serve") displacy.serve
 | 
				
			||||||
    +tag method
 | 
					    +tag method
 | 
				
			||||||
    +tag-new(2)
 | 
					    +tag-new(2)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
| 
						 | 
					@ -60,7 +58,7 @@ p
 | 
				
			||||||
        +cell bool
 | 
					        +cell bool
 | 
				
			||||||
        +cell
 | 
					        +cell
 | 
				
			||||||
            |  Don't parse #[code Doc] and instead, expect a dict or list of
 | 
					            |  Don't parse #[code Doc] and instead, expect a dict or list of
 | 
				
			||||||
            |  dicts. #[+a("/docs/usage/visualizers#manual-usage") See here]
 | 
					            |  dicts. #[+a("/usage/visualizers#manual-usage") See here]
 | 
				
			||||||
            |  for formats and examples.
 | 
					            |  for formats and examples.
 | 
				
			||||||
        +cell #[code False]
 | 
					        +cell #[code False]
 | 
				
			||||||
 | 
					
 | 
				
			||||||
| 
						 | 
					@ -70,7 +68,7 @@ p
 | 
				
			||||||
        +cell Port to serve visualization.
 | 
					        +cell Port to serve visualization.
 | 
				
			||||||
        +cell #[code 5000]
 | 
					        +cell #[code 5000]
 | 
				
			||||||
 | 
					
 | 
				
			||||||
+h(2, "render") displacy.render
 | 
					+h(3, "displacy.render") displacy.render
 | 
				
			||||||
    +tag method
 | 
					    +tag method
 | 
				
			||||||
    +tag-new(2)
 | 
					    +tag-new(2)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
| 
						 | 
					@ -127,24 +125,24 @@ p Render a dependency parse tree or named entity visualization.
 | 
				
			||||||
        +cell bool
 | 
					        +cell bool
 | 
				
			||||||
        +cell
 | 
					        +cell
 | 
				
			||||||
            |  Don't parse #[code Doc] and instead, expect a dict or list of
 | 
					            |  Don't parse #[code Doc] and instead, expect a dict or list of
 | 
				
			||||||
            |  dicts. #[+a("/docs/usage/visualizers#manual-usage") See here]
 | 
					            |  dicts. #[+a("/usage/visualizers#manual-usage") See here]
 | 
				
			||||||
            |  for formats and examples.
 | 
					            |  for formats and examples.
 | 
				
			||||||
        +cell #[code False]
 | 
					        +cell #[code False]
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell returns
 | 
					        +cell returns
 | 
				
			||||||
        +cell unicode
 | 
					        +cell unicode
 | 
				
			||||||
        +cell Rendered HTML markup.
 | 
					        +cell Rendered HTML markup.
 | 
				
			||||||
        +cell
 | 
					        +cell
 | 
				
			||||||
 | 
					
 | 
				
			||||||
+h(2, "options") Visualizer options
 | 
					+h(3, "displacy_options") Visualizer options
 | 
				
			||||||
 | 
					
 | 
				
			||||||
p
 | 
					p
 | 
				
			||||||
    |  The #[code options] argument lets you specify additional settings for
 | 
					    |  The #[code options] argument lets you specify additional settings for
 | 
				
			||||||
    |  each visualizer. If a setting is not present in the options, the default
 | 
					    |  each visualizer. If a setting is not present in the options, the default
 | 
				
			||||||
    |  value will be used.
 | 
					    |  value will be used.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
+h(3, "options-dep") Dependency Visualizer options
 | 
					+h(4, "options-dep") Dependency Visualizer options
 | 
				
			||||||
 | 
					
 | 
				
			||||||
+aside-code("Example").
 | 
					+aside-code("Example").
 | 
				
			||||||
    options = {'compact': True, 'color': 'blue'}
 | 
					    options = {'compact': True, 'color': 'blue'}
 | 
				
			||||||
| 
						 | 
					@ -219,7 +217,7 @@ p
 | 
				
			||||||
        +cell Distance between words in px.
 | 
					        +cell Distance between words in px.
 | 
				
			||||||
        +cell #[code 175] / #[code 85] (compact)
 | 
					        +cell #[code 175] / #[code 85] (compact)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
+h(3, "options-ent") Named Entity Visualizer options
 | 
					+h(4, "displacy_options-ent") Named Entity Visualizer options
 | 
				
			||||||
 | 
					
 | 
				
			||||||
+aside-code("Example").
 | 
					+aside-code("Example").
 | 
				
			||||||
    options = {'ents': ['PERSON', 'ORG', 'PRODUCT'],
 | 
					    options = {'ents': ['PERSON', 'ORG', 'PRODUCT'],
 | 
				
			||||||
| 
						 | 
					@ -244,6 +242,6 @@ p
 | 
				
			||||||
 | 
					
 | 
				
			||||||
p
 | 
					p
 | 
				
			||||||
    |  By default, displaCy comes with colours for all
 | 
					    |  By default, displaCy comes with colours for all
 | 
				
			||||||
    |  #[+a("/docs/api/annotation#named-entities") entity types supported by spaCy].
 | 
					    |  #[+a("/api/annotation#named-entities") entity types supported by spaCy].
 | 
				
			||||||
    |  If you're using custom entity types, you can use the #[code colors]
 | 
					    |  If you're using custom entity types, you can use the #[code colors]
 | 
				
			||||||
    |  setting to add your own colours for them.
 | 
					    |  setting to add your own colours for them.
 | 
				
			||||||
| 
						 | 
					@ -1,15 +1,13 @@
 | 
				
			||||||
//- 💫 DOCS > API > SPACY
 | 
					//- 💫 DOCS > API > TOP-LEVEL > SPACY
 | 
				
			||||||
 | 
					
 | 
				
			||||||
include ../../_includes/_mixins
 | 
					+h(3, "spacy.load") spacy.load
 | 
				
			||||||
 | 
					 | 
				
			||||||
+h(2, "load") spacy.load
 | 
					 | 
				
			||||||
    +tag function
 | 
					    +tag function
 | 
				
			||||||
    +tag-model
 | 
					    +tag-model
 | 
				
			||||||
 | 
					
 | 
				
			||||||
p
 | 
					p
 | 
				
			||||||
    |  Load a model via its #[+a("/docs/usage/models#usage") shortcut link],
 | 
					    |  Load a model via its #[+a("/usage/models#usage") shortcut link],
 | 
				
			||||||
    |  the name of an installed
 | 
					    |  the name of an installed
 | 
				
			||||||
    |  #[+a("/docs/usage/saving-loading#generating") model package], a unicode
 | 
					    |  #[+a("/usage/training#models-generating") model package], a unicode
 | 
				
			||||||
    |  path or a #[code Path]-like object. spaCy will try resolving the load
 | 
					    |  path or a #[code Path]-like object. spaCy will try resolving the load
 | 
				
			||||||
    |  argument in this order. If a model is loaded from a shortcut link or
 | 
					    |  argument in this order. If a model is loaded from a shortcut link or
 | 
				
			||||||
    |  package name, spaCy will assume it's a Python package and import it and
 | 
					    |  package name, spaCy will assume it's a Python package and import it and
 | 
				
			||||||
| 
						 | 
					@ -38,25 +36,57 @@ p
 | 
				
			||||||
        +cell list
 | 
					        +cell list
 | 
				
			||||||
        +cell
 | 
					        +cell
 | 
				
			||||||
            |  Names of pipeline components to
 | 
					            |  Names of pipeline components to
 | 
				
			||||||
            |  #[+a("/docs/usage/language-processing-pipeline#disabling") disable].
 | 
					            |  #[+a("/usage/processing-pipelines#disabling") disable].
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell returns
 | 
					        +cell returns
 | 
				
			||||||
        +cell #[code Language]
 | 
					        +cell #[code Language]
 | 
				
			||||||
        +cell A #[code Language] object with the loaded model.
 | 
					        +cell A #[code Language] object with the loaded model.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
+infobox("⚠️ Deprecation note")
 | 
					+infobox("Deprecation note", "⚠️")
 | 
				
			||||||
    .o-block
 | 
					    .o-block
 | 
				
			||||||
        |  As of spaCy 2.0, the #[code path] keyword argument is deprecated. spaCy
 | 
					        |  As of spaCy 2.0, the #[code path] keyword argument is deprecated. spaCy
 | 
				
			||||||
        |  will also raise an error if no model could be loaded and never just
 | 
					        |  will also raise an error if no model could be loaded and never just
 | 
				
			||||||
        |  return an empty #[code Language] object. If you need a blank language,
 | 
					        |  return an empty #[code Language] object. If you need a blank language,
 | 
				
			||||||
        |  you need to import it explicitly (#[code from spacy.lang.en import English])
 | 
					        |  you can use the new function #[+api("spacy#blank") #[code spacy.blank()]]
 | 
				
			||||||
        |  or use #[+api("util#get_lang_class") #[code util.get_lang_class]].
 | 
					        |  or import the class explicitly, e.g.
 | 
				
			||||||
 | 
					        |  #[code from spacy.lang.en import English].
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    +code-new nlp = spacy.load('/model')
 | 
					    +code-new nlp = spacy.load('/model')
 | 
				
			||||||
    +code-old nlp = spacy.load('en', path='/model')
 | 
					    +code-old nlp = spacy.load('en', path='/model')
 | 
				
			||||||
 | 
					
 | 
				
			||||||
+h(2, "info") spacy.info
 | 
					+h(3, "spacy.blank") spacy.blank
 | 
				
			||||||
 | 
					    +tag function
 | 
				
			||||||
 | 
					    +tag-new(2)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					p
 | 
				
			||||||
 | 
					    |  Create a blank model of a given language class. This function is the
 | 
				
			||||||
 | 
					    |  twin of #[code spacy.load()].
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+aside-code("Example").
 | 
				
			||||||
 | 
					    nlp_en = spacy.blank('en')
 | 
				
			||||||
 | 
					    nlp_de = spacy.blank('de')
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+table(["Name", "Type", "Description"])
 | 
				
			||||||
 | 
					    +row
 | 
				
			||||||
 | 
					        +cell #[code name]
 | 
				
			||||||
 | 
					        +cell unicode
 | 
				
			||||||
 | 
					        +cell ISO code of the language class to load.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    +row
 | 
				
			||||||
 | 
					        +cell #[code disable]
 | 
				
			||||||
 | 
					        +cell list
 | 
				
			||||||
 | 
					        +cell
 | 
				
			||||||
 | 
					            |  Names of pipeline components to
 | 
				
			||||||
 | 
					            |  #[+a("/usage/processing-pipelines#disabling") disable].
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    +row("foot")
 | 
				
			||||||
 | 
					        +cell returns
 | 
				
			||||||
 | 
					        +cell #[code Language]
 | 
				
			||||||
 | 
					        +cell An empty #[code Language] object of the appropriate subclass.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+h(4, "spacy.info") spacy.info
 | 
				
			||||||
    +tag function
 | 
					    +tag function
 | 
				
			||||||
 | 
					
 | 
				
			||||||
p
 | 
					p
 | 
				
			||||||
| 
						 | 
					@ -83,13 +113,13 @@ p
 | 
				
			||||||
        +cell Print information as Markdown.
 | 
					        +cell Print information as Markdown.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
+h(2, "explain") spacy.explain
 | 
					+h(3, "spacy.explain") spacy.explain
 | 
				
			||||||
    +tag function
 | 
					    +tag function
 | 
				
			||||||
 | 
					
 | 
				
			||||||
p
 | 
					p
 | 
				
			||||||
    |  Get a description for a given POS tag, dependency label or entity type.
 | 
					    |  Get a description for a given POS tag, dependency label or entity type.
 | 
				
			||||||
    |  For a list of available terms, see
 | 
					    |  For a list of available terms, see
 | 
				
			||||||
    |  #[+src(gh("spacy", "spacy/glossary.py")) glossary.py].
 | 
					    |  #[+src(gh("spacy", "spacy/glossary.py")) #[code glossary.py]].
 | 
				
			||||||
 | 
					
 | 
				
			||||||
+aside-code("Example").
 | 
					+aside-code("Example").
 | 
				
			||||||
    spacy.explain('NORP')
 | 
					    spacy.explain('NORP')
 | 
				
			||||||
| 
						 | 
					@ -107,18 +137,18 @@ p
 | 
				
			||||||
        +cell unicode
 | 
					        +cell unicode
 | 
				
			||||||
        +cell Term to explain.
 | 
					        +cell Term to explain.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell returns
 | 
					        +cell returns
 | 
				
			||||||
        +cell unicode
 | 
					        +cell unicode
 | 
				
			||||||
        +cell The explanation, or #[code None] if not found in the glossary.
 | 
					        +cell The explanation, or #[code None] if not found in the glossary.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
+h(2, "set_factory") spacy.set_factory
 | 
					+h(3, "spacy.set_factory") spacy.set_factory
 | 
				
			||||||
    +tag function
 | 
					    +tag function
 | 
				
			||||||
    +tag-new(2)
 | 
					    +tag-new(2)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
p
 | 
					p
 | 
				
			||||||
    |  Set a factory that returns a custom
 | 
					    |  Set a factory that returns a custom
 | 
				
			||||||
    |  #[+a("/docs/usage/language-processing-pipeline") processing pipeline]
 | 
					    |  #[+a("/usage/processing-pipelines") processing pipeline]
 | 
				
			||||||
    |  component. Factories are useful for creating stateful components, especially ones which depend on shared data.
 | 
					    |  component. Factories are useful for creating stateful components, especially ones which depend on shared data.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
+aside-code("Example").
 | 
					+aside-code("Example").
 | 
				
			||||||
| 
						 | 
					@ -1,10 +1,8 @@
 | 
				
			||||||
//- 💫 DOCS > API > UTIL
 | 
					//- 💫 DOCS > API > TOP-LEVEL > UTIL
 | 
				
			||||||
 | 
					 | 
				
			||||||
include ../../_includes/_mixins
 | 
					 | 
				
			||||||
 | 
					
 | 
				
			||||||
p
 | 
					p
 | 
				
			||||||
    |  spaCy comes with a small collection of utility functions located in
 | 
					    |  spaCy comes with a small collection of utility functions located in
 | 
				
			||||||
    |  #[+src(gh("spaCy", "spacy/util.py")) spacy/util.py].
 | 
					    |  #[+src(gh("spaCy", "spacy/util.py")) #[code spacy/util.py]].
 | 
				
			||||||
    |  Because utility functions are mostly intended for
 | 
					    |  Because utility functions are mostly intended for
 | 
				
			||||||
    |  #[strong internal use within spaCy], their behaviour may change with
 | 
					    |  #[strong internal use within spaCy], their behaviour may change with
 | 
				
			||||||
    |  future releases. The functions documented on this page should be safe
 | 
					    |  future releases. The functions documented on this page should be safe
 | 
				
			||||||
| 
						 | 
					@ -12,7 +10,7 @@ p
 | 
				
			||||||
    |  recommend having additional tests in place if your application depends on
 | 
					    |  recommend having additional tests in place if your application depends on
 | 
				
			||||||
    |  any of spaCy's utilities.
 | 
					    |  any of spaCy's utilities.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
+h(2, "get_data_path") util.get_data_path
 | 
					+h(3, "util.get_data_path") util.get_data_path
 | 
				
			||||||
    +tag function
 | 
					    +tag function
 | 
				
			||||||
 | 
					
 | 
				
			||||||
p
 | 
					p
 | 
				
			||||||
| 
						 | 
					@ -25,12 +23,12 @@ p
 | 
				
			||||||
        +cell bool
 | 
					        +cell bool
 | 
				
			||||||
        +cell Only return path if it exists, otherwise return #[code None].
 | 
					        +cell Only return path if it exists, otherwise return #[code None].
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell returns
 | 
					        +cell returns
 | 
				
			||||||
        +cell #[code Path] / #[code None]
 | 
					        +cell #[code Path] / #[code None]
 | 
				
			||||||
        +cell Data path or #[code None].
 | 
					        +cell Data path or #[code None].
 | 
				
			||||||
 | 
					
 | 
				
			||||||
+h(2, "set_data_path") util.set_data_path
 | 
					+h(3, "util.set_data_path") util.set_data_path
 | 
				
			||||||
    +tag function
 | 
					    +tag function
 | 
				
			||||||
 | 
					
 | 
				
			||||||
p
 | 
					p
 | 
				
			||||||
| 
						 | 
					@ -47,12 +45,12 @@ p
 | 
				
			||||||
        +cell unicode or #[code Path]
 | 
					        +cell unicode or #[code Path]
 | 
				
			||||||
        +cell Path to new data directory.
 | 
					        +cell Path to new data directory.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
+h(2, "get_lang_class") util.get_lang_class
 | 
					+h(3, "util.get_lang_class") util.get_lang_class
 | 
				
			||||||
    +tag function
 | 
					    +tag function
 | 
				
			||||||
 | 
					
 | 
				
			||||||
p
 | 
					p
 | 
				
			||||||
    |  Import and load a #[code Language] class. Allows lazy-loading
 | 
					    |  Import and load a #[code Language] class. Allows lazy-loading
 | 
				
			||||||
    |  #[+a("/docs/usage/adding-languages") language data] and importing
 | 
					    |  #[+a("/usage/adding-languages") language data] and importing
 | 
				
			||||||
    |  languages using the two-letter language code.
 | 
					    |  languages using the two-letter language code.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
+aside-code("Example").
 | 
					+aside-code("Example").
 | 
				
			||||||
| 
						 | 
					@ -67,12 +65,12 @@ p
 | 
				
			||||||
        +cell unicode
 | 
					        +cell unicode
 | 
				
			||||||
        +cell Two-letter language code, e.g. #[code 'en'].
 | 
					        +cell Two-letter language code, e.g. #[code 'en'].
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell returns
 | 
					        +cell returns
 | 
				
			||||||
        +cell #[code Language]
 | 
					        +cell #[code Language]
 | 
				
			||||||
        +cell Language class.
 | 
					        +cell Language class.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
+h(2, "load_model") util.load_model
 | 
					+h(3, "util.load_model") util.load_model
 | 
				
			||||||
    +tag function
 | 
					    +tag function
 | 
				
			||||||
    +tag-new(2)
 | 
					    +tag-new(2)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
| 
						 | 
					@ -101,12 +99,12 @@ p
 | 
				
			||||||
        +cell -
 | 
					        +cell -
 | 
				
			||||||
        +cell Specific overrides, like pipeline components to disable.
 | 
					        +cell Specific overrides, like pipeline components to disable.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell returns
 | 
					        +cell returns
 | 
				
			||||||
        +cell #[code Language]
 | 
					        +cell #[code Language]
 | 
				
			||||||
        +cell #[code Language] class with the loaded model.
 | 
					        +cell #[code Language] class with the loaded model.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
+h(2, "load_model_from_path") util.load_model_from_path
 | 
					+h(3, "util.load_model_from_path") util.load_model_from_path
 | 
				
			||||||
    +tag function
 | 
					    +tag function
 | 
				
			||||||
    +tag-new(2)
 | 
					    +tag-new(2)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
| 
						 | 
					@ -139,18 +137,18 @@ p
 | 
				
			||||||
        +cell -
 | 
					        +cell -
 | 
				
			||||||
        +cell Specific overrides, like pipeline components to disable.
 | 
					        +cell Specific overrides, like pipeline components to disable.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell returns
 | 
					        +cell returns
 | 
				
			||||||
        +cell #[code Language]
 | 
					        +cell #[code Language]
 | 
				
			||||||
        +cell #[code Language] class with the loaded model.
 | 
					        +cell #[code Language] class with the loaded model.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
+h(2, "load_model_from_init_py") util.load_model_from_init_py
 | 
					+h(3, "util.load_model_from_init_py") util.load_model_from_init_py
 | 
				
			||||||
    +tag function
 | 
					    +tag function
 | 
				
			||||||
    +tag-new(2)
 | 
					    +tag-new(2)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
p
 | 
					p
 | 
				
			||||||
    |  A helper function to use in the #[code load()] method of a model package's
 | 
					    |  A helper function to use in the #[code load()] method of a model package's
 | 
				
			||||||
    |  #[+src(gh("spacy-dev-resources", "templates/model/en_model_name/__init__.py")) __init__.py].
 | 
					    |  #[+src(gh("spacy-dev-resources", "templates/model/en_model_name/__init__.py")) #[code __init__.py]].
 | 
				
			||||||
 | 
					
 | 
				
			||||||
+aside-code("Example").
 | 
					+aside-code("Example").
 | 
				
			||||||
    from spacy.util import load_model_from_init_py
 | 
					    from spacy.util import load_model_from_init_py
 | 
				
			||||||
| 
						 | 
					@ -169,12 +167,12 @@ p
 | 
				
			||||||
        +cell -
 | 
					        +cell -
 | 
				
			||||||
        +cell Specific overrides, like pipeline components to disable.
 | 
					        +cell Specific overrides, like pipeline components to disable.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell returns
 | 
					        +cell returns
 | 
				
			||||||
        +cell #[code Language]
 | 
					        +cell #[code Language]
 | 
				
			||||||
        +cell #[code Language] class with the loaded model.
 | 
					        +cell #[code Language] class with the loaded model.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
+h(2, "get_model_meta") util.get_model_meta
 | 
					+h(3, "util.get_model_meta") util.get_model_meta
 | 
				
			||||||
    +tag function
 | 
					    +tag function
 | 
				
			||||||
    +tag-new(2)
 | 
					    +tag-new(2)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
| 
						 | 
					@ -190,17 +188,17 @@ p
 | 
				
			||||||
        +cell unicode or #[code Path]
 | 
					        +cell unicode or #[code Path]
 | 
				
			||||||
        +cell Path to model directory.
 | 
					        +cell Path to model directory.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell returns
 | 
					        +cell returns
 | 
				
			||||||
        +cell dict
 | 
					        +cell dict
 | 
				
			||||||
        +cell The model's meta data.
 | 
					        +cell The model's meta data.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
+h(2, "is_package") util.is_package
 | 
					+h(3, "util.is_package") util.is_package
 | 
				
			||||||
    +tag function
 | 
					    +tag function
 | 
				
			||||||
 | 
					
 | 
				
			||||||
p
 | 
					p
 | 
				
			||||||
    |  Check if string maps to a package installed via pip. Mainly used to
 | 
					    |  Check if string maps to a package installed via pip. Mainly used to
 | 
				
			||||||
    |  validate #[+a("/docs/usage/models") model packages].
 | 
					    |  validate #[+a("/usage/models") model packages].
 | 
				
			||||||
 | 
					
 | 
				
			||||||
+aside-code("Example").
 | 
					+aside-code("Example").
 | 
				
			||||||
    util.is_package('en_core_web_sm') # True
 | 
					    util.is_package('en_core_web_sm') # True
 | 
				
			||||||
| 
						 | 
					@ -212,18 +210,18 @@ p
 | 
				
			||||||
        +cell unicode
 | 
					        +cell unicode
 | 
				
			||||||
        +cell Name of package.
 | 
					        +cell Name of package.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell returns
 | 
					        +cell returns
 | 
				
			||||||
        +cell #[code bool]
 | 
					        +cell #[code bool]
 | 
				
			||||||
        +cell #[code True] if installed package, #[code False] if not.
 | 
					        +cell #[code True] if installed package, #[code False] if not.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
+h(2, "get_package_path") util.get_package_path
 | 
					+h(3, "util.get_package_path") util.get_package_path
 | 
				
			||||||
    +tag function
 | 
					    +tag function
 | 
				
			||||||
    +tag-new(2)
 | 
					    +tag-new(2)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
p
 | 
					p
 | 
				
			||||||
    |  Get path to an installed package. Mainly used to resolve the location of
 | 
					    |  Get path to an installed package. Mainly used to resolve the location of
 | 
				
			||||||
    |  #[+a("/docs/usage/models") model packages]. Currently imports the package
 | 
					    |  #[+a("/usage/models") model packages]. Currently imports the package
 | 
				
			||||||
    |  to find its path.
 | 
					    |  to find its path.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
+aside-code("Example").
 | 
					+aside-code("Example").
 | 
				
			||||||
| 
						 | 
					@ -236,12 +234,12 @@ p
 | 
				
			||||||
        +cell unicode
 | 
					        +cell unicode
 | 
				
			||||||
        +cell Name of installed package.
 | 
					        +cell Name of installed package.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell returns
 | 
					        +cell returns
 | 
				
			||||||
        +cell #[code Path]
 | 
					        +cell #[code Path]
 | 
				
			||||||
        +cell Path to model package directory.
 | 
					        +cell Path to model package directory.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
+h(2, "is_in_jupyter") util.is_in_jupyter
 | 
					+h(3, "util.is_in_jupyter") util.is_in_jupyter
 | 
				
			||||||
    +tag function
 | 
					    +tag function
 | 
				
			||||||
    +tag-new(2)
 | 
					    +tag-new(2)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
| 
						 | 
					@ -257,17 +255,17 @@ p
 | 
				
			||||||
        return display(HTML(html))
 | 
					        return display(HTML(html))
 | 
				
			||||||
 | 
					
 | 
				
			||||||
+table(["Name", "Type", "Description"])
 | 
					+table(["Name", "Type", "Description"])
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell returns
 | 
					        +cell returns
 | 
				
			||||||
        +cell bool
 | 
					        +cell bool
 | 
				
			||||||
        +cell #[code True] if in Jupyter, #[code False] if not.
 | 
					        +cell #[code True] if in Jupyter, #[code False] if not.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
+h(2, "update_exc") util.update_exc
 | 
					+h(3, "util.update_exc") util.update_exc
 | 
				
			||||||
    +tag function
 | 
					    +tag function
 | 
				
			||||||
 | 
					
 | 
				
			||||||
p
 | 
					p
 | 
				
			||||||
    |  Update, validate and overwrite
 | 
					    |  Update, validate and overwrite
 | 
				
			||||||
    |  #[+a("/docs/usage/adding-languages#tokenizer-exceptions") tokenizer exceptions].
 | 
					    |  #[+a("/usage/adding-languages#tokenizer-exceptions") tokenizer exceptions].
 | 
				
			||||||
    |  Used to combine global  exceptions with custom, language-specific
 | 
					    |  Used to combine global  exceptions with custom, language-specific
 | 
				
			||||||
    |  exceptions. Will raise an error if key doesn't match #[code ORTH] values.
 | 
					    |  exceptions. Will raise an error if key doesn't match #[code ORTH] values.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
| 
						 | 
					@ -288,20 +286,20 @@ p
 | 
				
			||||||
        +cell dicts
 | 
					        +cell dicts
 | 
				
			||||||
        +cell Exception dictionaries to add to the base exceptions, in order.
 | 
					        +cell Exception dictionaries to add to the base exceptions, in order.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell returns
 | 
					        +cell returns
 | 
				
			||||||
        +cell dict
 | 
					        +cell dict
 | 
				
			||||||
        +cell Combined tokenizer exceptions.
 | 
					        +cell Combined tokenizer exceptions.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
+h(2, "prints") util.prints
 | 
					+h(3, "util.prints") util.prints
 | 
				
			||||||
    +tag function
 | 
					    +tag function
 | 
				
			||||||
    +tag-new(2)
 | 
					    +tag-new(2)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
p
 | 
					p
 | 
				
			||||||
    |  Print a formatted, text-wrapped message with optional title. If a text
 | 
					    |  Print a formatted, text-wrapped message with optional title. If a text
 | 
				
			||||||
    |  argument is a #[code Path], it's converted to a string. Should only
 | 
					    |  argument is a #[code Path], it's converted to a string. Should only
 | 
				
			||||||
    |  be used for interactive components like the #[+api("cli") cli].
 | 
					    |  be used for interactive components like the command-line interface.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
+aside-code("Example").
 | 
					+aside-code("Example").
 | 
				
			||||||
    data_path = Path('/some/path')
 | 
					    data_path = Path('/some/path')
 | 
				
			||||||
							
								
								
									
										131
									
								
								website/api/annotation.jade
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										131
									
								
								website/api/annotation.jade
									
									
									
									
									
										Normal file
									
								
							| 
						 | 
					@ -0,0 +1,131 @@
 | 
				
			||||||
 | 
					//- 💫 DOCS > API > ANNOTATION SPECS
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					include ../_includes/_mixins
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					p This document describes the target annotations spaCy is trained to predict.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+section("tokenization")
 | 
				
			||||||
 | 
					    +h(2, "tokenization") Tokenization
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    p
 | 
				
			||||||
 | 
					        |  Tokenization standards are based on the
 | 
				
			||||||
 | 
					        |  #[+a("https://catalog.ldc.upenn.edu/LDC2013T19") OntoNotes 5] corpus.
 | 
				
			||||||
 | 
					        |  The tokenizer differs from most by including tokens for significant
 | 
				
			||||||
 | 
					        |  whitespace. Any sequence of whitespace characters beyond a single space
 | 
				
			||||||
 | 
					        |  (#[code ' ']) is included as a token.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    +aside-code("Example").
 | 
				
			||||||
 | 
					        from spacy.lang.en import English
 | 
				
			||||||
 | 
					        nlp = English()
 | 
				
			||||||
 | 
					        tokens = nlp('Some\nspaces  and\ttab characters')
 | 
				
			||||||
 | 
					        tokens_text = [t.text for t in tokens]
 | 
				
			||||||
 | 
					        assert tokens_text == ['Some', '\n', 'spaces', ' ', 'and',
 | 
				
			||||||
 | 
					                            '\t', 'tab', 'characters']
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    p
 | 
				
			||||||
 | 
					        |  The whitespace tokens are useful for much the same reason punctuation is
 | 
				
			||||||
 | 
					        |  – it's often an important delimiter in the text. By preserving it in the
 | 
				
			||||||
 | 
					        |  token output, we are able to maintain a simple alignment between the
 | 
				
			||||||
 | 
					        |  tokens and the original string, and we ensure that no information is
 | 
				
			||||||
 | 
					        |  lost during processing.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+section("sbd")
 | 
				
			||||||
 | 
					    +h(2, "sentence-boundary") Sentence boundary detection
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    p
 | 
				
			||||||
 | 
					        |  Sentence boundaries are calculated from the syntactic parse tree, so
 | 
				
			||||||
 | 
					        |  features such as punctuation and capitalisation play an important but
 | 
				
			||||||
 | 
					        |  non-decisive role in determining the sentence boundaries. Usually this
 | 
				
			||||||
 | 
					        |  means that the sentence boundaries will at least coincide with clause
 | 
				
			||||||
 | 
					        |  boundaries, even given poorly punctuated text.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+section("pos-tagging")
 | 
				
			||||||
 | 
					    +h(2, "pos-tagging") Part-of-speech Tagging
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    +aside("Tip: Understanding tags")
 | 
				
			||||||
 | 
					        |  You can also use #[code spacy.explain()] to get the description for the
 | 
				
			||||||
 | 
					        |  string representation of a tag. For example,
 | 
				
			||||||
 | 
					        |  #[code spacy.explain("RB")] will return "adverb".
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    include _annotation/_pos-tags
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+section("lemmatization")
 | 
				
			||||||
 | 
					    +h(2, "lemmatization") Lemmatization
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    p A "lemma" is the uninflected form of a word. In English, this means:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    +list
 | 
				
			||||||
 | 
					        +item #[strong Adjectives]: The form like "happy", not "happier" or "happiest"
 | 
				
			||||||
 | 
					        +item #[strong Adverbs]: The form like "badly", not "worse" or "worst"
 | 
				
			||||||
 | 
					        +item #[strong Nouns]: The form like "dog", not "dogs"; like "child", not "children"
 | 
				
			||||||
 | 
					        +item #[strong Verbs]: The form like "write", not "writes", "writing", "wrote" or "written"
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    p
 | 
				
			||||||
 | 
					        |  The lemmatization data is taken from
 | 
				
			||||||
 | 
					        |  #[+a("https://wordnet.princeton.edu") WordNet]. However, we also add a
 | 
				
			||||||
 | 
					        |  special case for pronouns: all pronouns are lemmatized to the special
 | 
				
			||||||
 | 
					        |  token #[code -PRON-].
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    +infobox("About spaCy's custom pronoun lemma")
 | 
				
			||||||
 | 
					        |  Unlike verbs and common nouns, there's no clear base form of a personal
 | 
				
			||||||
 | 
					        |  pronoun. Should the lemma of "me" be "I", or should we normalize person
 | 
				
			||||||
 | 
					        |  as well, giving "it" — or maybe "he"? spaCy's solution is to introduce a
 | 
				
			||||||
 | 
					        |  novel symbol, #[code -PRON-], which is used as the lemma for
 | 
				
			||||||
 | 
					        |  all personal pronouns.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+section("dependency-parsing")
 | 
				
			||||||
 | 
					    +h(2, "dependency-parsing") Syntactic Dependency Parsing
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    +aside("Tip: Understanding labels")
 | 
				
			||||||
 | 
					        |  You can also use #[code spacy.explain()] to get the description for the
 | 
				
			||||||
 | 
					        |  string representation of a label. For example,
 | 
				
			||||||
 | 
					        |  #[code spacy.explain("prt")] will return "particle".
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    include _annotation/_dep-labels
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+section("named-entities")
 | 
				
			||||||
 | 
					    +h(2, "named-entities") Named Entity Recognition
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    +aside("Tip: Understanding entity types")
 | 
				
			||||||
 | 
					        |  You can also use #[code spacy.explain()] to get the description for the
 | 
				
			||||||
 | 
					        |  string representation of an entity label. For example,
 | 
				
			||||||
 | 
					        |  #[code spacy.explain("LANGUAGE")] will return "any named language".
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    include _annotation/_named-entities
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    +h(3, "biluo") BILUO Scheme
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    include _annotation/_biluo
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+section("training")
 | 
				
			||||||
 | 
					    +h(2, "json-input") JSON input format for training
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    +under-construction
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    p spaCy takes training data in the following format:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    +code("Example structure").
 | 
				
			||||||
 | 
					        doc: {
 | 
				
			||||||
 | 
					            id: string,
 | 
				
			||||||
 | 
					            paragraphs: [{
 | 
				
			||||||
 | 
					                raw: string,
 | 
				
			||||||
 | 
					                sents: [int],
 | 
				
			||||||
 | 
					                tokens: [{
 | 
				
			||||||
 | 
					                    start: int,
 | 
				
			||||||
 | 
					                    tag: string,
 | 
				
			||||||
 | 
					                    head: int,
 | 
				
			||||||
 | 
					                    dep: string
 | 
				
			||||||
 | 
					                }],
 | 
				
			||||||
 | 
					                ner: [{
 | 
				
			||||||
 | 
					                    start: int,
 | 
				
			||||||
 | 
					                    end: int,
 | 
				
			||||||
 | 
					                    label: string
 | 
				
			||||||
 | 
					                }],
 | 
				
			||||||
 | 
					                brackets: [{
 | 
				
			||||||
 | 
					                    start: int,
 | 
				
			||||||
 | 
					                    end: int,
 | 
				
			||||||
 | 
					                    label: string
 | 
				
			||||||
 | 
					                }]
 | 
				
			||||||
 | 
					            }]
 | 
				
			||||||
 | 
					        }
 | 
				
			||||||
| 
						 | 
					@ -1,6 +1,6 @@
 | 
				
			||||||
//- 💫 DOCS > API > BINDER
 | 
					//- 💫 DOCS > API > BINDER
 | 
				
			||||||
 | 
					
 | 
				
			||||||
include ../../_includes/_mixins
 | 
					include ../_includes/_mixins
 | 
				
			||||||
 | 
					
 | 
				
			||||||
p A container class for serializing collections of #[code Doc] objects.
 | 
					p A container class for serializing collections of #[code Doc] objects.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
							
								
								
									
										5
									
								
								website/api/dependencyparser.jade
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										5
									
								
								website/api/dependencyparser.jade
									
									
									
									
									
										Normal file
									
								
							| 
						 | 
					@ -0,0 +1,5 @@
 | 
				
			||||||
 | 
					//- 💫 DOCS > API > DEPENDENCYPARSER
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					include ../_includes/_mixins
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					!=partial("pipe", { subclass: "DependencyParser", short: "parser", pipeline_id: "parser" })
 | 
				
			||||||
| 
						 | 
					@ -1,8 +1,6 @@
 | 
				
			||||||
//- 💫 DOCS > API > DOC
 | 
					//- 💫 DOCS > API > DOC
 | 
				
			||||||
 | 
					
 | 
				
			||||||
include ../../_includes/_mixins
 | 
					include ../_includes/_mixins
 | 
				
			||||||
 | 
					 | 
				
			||||||
p A container for accessing linguistic annotations.
 | 
					 | 
				
			||||||
 | 
					
 | 
				
			||||||
p
 | 
					p
 | 
				
			||||||
    |  A #[code Doc] is a sequence of #[+api("token") #[code Token]] objects.
 | 
					    |  A #[code Doc] is a sequence of #[+api("token") #[code Token]] objects.
 | 
				
			||||||
| 
						 | 
					@ -47,7 +45,7 @@ p
 | 
				
			||||||
            |  subsequent space. Must have the same length as #[code words], if
 | 
					            |  subsequent space. Must have the same length as #[code words], if
 | 
				
			||||||
            |  specified. Defaults to a sequence of #[code True].
 | 
					            |  specified. Defaults to a sequence of #[code True].
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell returns
 | 
					        +cell returns
 | 
				
			||||||
        +cell #[code Doc]
 | 
					        +cell #[code Doc]
 | 
				
			||||||
        +cell The newly constructed object.
 | 
					        +cell The newly constructed object.
 | 
				
			||||||
| 
						 | 
					@ -73,7 +71,7 @@ p
 | 
				
			||||||
        +cell int
 | 
					        +cell int
 | 
				
			||||||
        +cell The index of the token.
 | 
					        +cell The index of the token.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell returns
 | 
					        +cell returns
 | 
				
			||||||
        +cell #[code Token]
 | 
					        +cell #[code Token]
 | 
				
			||||||
        +cell The token at #[code doc[i]].
 | 
					        +cell The token at #[code doc[i]].
 | 
				
			||||||
| 
						 | 
					@ -96,7 +94,7 @@ p
 | 
				
			||||||
        +cell tuple
 | 
					        +cell tuple
 | 
				
			||||||
        +cell The slice of the document to get.
 | 
					        +cell The slice of the document to get.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell returns
 | 
					        +cell returns
 | 
				
			||||||
        +cell #[code Span]
 | 
					        +cell #[code Span]
 | 
				
			||||||
        +cell The span at #[code doc[start : end]].
 | 
					        +cell The span at #[code doc[start : end]].
 | 
				
			||||||
| 
						 | 
					@ -120,7 +118,7 @@ p
 | 
				
			||||||
    |  from Cython.
 | 
					    |  from Cython.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
+table(["Name", "Type", "Description"])
 | 
					+table(["Name", "Type", "Description"])
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell yields
 | 
					        +cell yields
 | 
				
			||||||
        +cell #[code Token]
 | 
					        +cell #[code Token]
 | 
				
			||||||
        +cell A #[code Token] object.
 | 
					        +cell A #[code Token] object.
 | 
				
			||||||
| 
						 | 
					@ -135,7 +133,7 @@ p Get the number of tokens in the document.
 | 
				
			||||||
    assert len(doc) == 7
 | 
					    assert len(doc) == 7
 | 
				
			||||||
 | 
					
 | 
				
			||||||
+table(["Name", "Type", "Description"])
 | 
					+table(["Name", "Type", "Description"])
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell returns
 | 
					        +cell returns
 | 
				
			||||||
        +cell int
 | 
					        +cell int
 | 
				
			||||||
        +cell The number of tokens in the document.
 | 
					        +cell The number of tokens in the document.
 | 
				
			||||||
| 
						 | 
					@ -172,7 +170,7 @@ p Create a #[code Span] object from the slice #[code doc.text[start : end]].
 | 
				
			||||||
        +cell #[code.u-break numpy.ndarray[ndim=1, dtype='float32']]
 | 
					        +cell #[code.u-break numpy.ndarray[ndim=1, dtype='float32']]
 | 
				
			||||||
        +cell A meaning representation of the span.
 | 
					        +cell A meaning representation of the span.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell returns
 | 
					        +cell returns
 | 
				
			||||||
        +cell #[code Span]
 | 
					        +cell #[code Span]
 | 
				
			||||||
        +cell The newly constructed object.
 | 
					        +cell The newly constructed object.
 | 
				
			||||||
| 
						 | 
					@ -200,7 +198,7 @@ p
 | 
				
			||||||
            |  The object to compare with. By default, accepts #[code Doc],
 | 
					            |  The object to compare with. By default, accepts #[code Doc],
 | 
				
			||||||
            |  #[code Span], #[code Token] and #[code Lexeme] objects.
 | 
					            |  #[code Span], #[code Token] and #[code Lexeme] objects.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell returns
 | 
					        +cell returns
 | 
				
			||||||
        +cell float
 | 
					        +cell float
 | 
				
			||||||
        +cell A scalar similarity score. Higher is more similar.
 | 
					        +cell A scalar similarity score. Higher is more similar.
 | 
				
			||||||
| 
						 | 
					@ -226,7 +224,7 @@ p
 | 
				
			||||||
        +cell int
 | 
					        +cell int
 | 
				
			||||||
        +cell The attribute ID
 | 
					        +cell The attribute ID
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell returns
 | 
					        +cell returns
 | 
				
			||||||
        +cell dict
 | 
					        +cell dict
 | 
				
			||||||
        +cell A dictionary mapping attributes to integer counts.
 | 
					        +cell A dictionary mapping attributes to integer counts.
 | 
				
			||||||
| 
						 | 
					@ -251,7 +249,7 @@ p
 | 
				
			||||||
        +cell list
 | 
					        +cell list
 | 
				
			||||||
        +cell A list of attribute ID ints.
 | 
					        +cell A list of attribute ID ints.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell returns
 | 
					        +cell returns
 | 
				
			||||||
        +cell #[code.u-break numpy.ndarray[ndim=2, dtype='int32']]
 | 
					        +cell #[code.u-break numpy.ndarray[ndim=2, dtype='int32']]
 | 
				
			||||||
        +cell
 | 
					        +cell
 | 
				
			||||||
| 
						 | 
					@ -285,7 +283,7 @@ p
 | 
				
			||||||
        +cell #[code.u-break numpy.ndarray[ndim=2, dtype='int32']]
 | 
					        +cell #[code.u-break numpy.ndarray[ndim=2, dtype='int32']]
 | 
				
			||||||
        +cell The attribute values to load.
 | 
					        +cell The attribute values to load.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell returns
 | 
					        +cell returns
 | 
				
			||||||
        +cell #[code Doc]
 | 
					        +cell #[code Doc]
 | 
				
			||||||
        +cell Itself.
 | 
					        +cell Itself.
 | 
				
			||||||
| 
						 | 
					@ -326,7 +324,7 @@ p Loads state from a directory. Modifies the object in place and returns it.
 | 
				
			||||||
            |  A path to a directory. Paths may be either strings or
 | 
					            |  A path to a directory. Paths may be either strings or
 | 
				
			||||||
            |  #[code Path]-like objects.
 | 
					            |  #[code Path]-like objects.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell returns
 | 
					        +cell returns
 | 
				
			||||||
        +cell #[code Doc]
 | 
					        +cell #[code Doc]
 | 
				
			||||||
        +cell The modified #[code Doc] object.
 | 
					        +cell The modified #[code Doc] object.
 | 
				
			||||||
| 
						 | 
					@ -341,7 +339,7 @@ p Serialize, i.e. export the document contents to a binary string.
 | 
				
			||||||
    doc_bytes = doc.to_bytes()
 | 
					    doc_bytes = doc.to_bytes()
 | 
				
			||||||
 | 
					
 | 
				
			||||||
+table(["Name", "Type", "Description"])
 | 
					+table(["Name", "Type", "Description"])
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell returns
 | 
					        +cell returns
 | 
				
			||||||
        +cell bytes
 | 
					        +cell bytes
 | 
				
			||||||
        +cell
 | 
					        +cell
 | 
				
			||||||
| 
						 | 
					@ -367,7 +365,7 @@ p Deserialize, i.e. import the document contents from a binary string.
 | 
				
			||||||
        +cell bytes
 | 
					        +cell bytes
 | 
				
			||||||
        +cell The string to load from.
 | 
					        +cell The string to load from.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell returns
 | 
					        +cell returns
 | 
				
			||||||
        +cell #[code Doc]
 | 
					        +cell #[code Doc]
 | 
				
			||||||
        +cell The #[code Doc] object.
 | 
					        +cell The #[code Doc] object.
 | 
				
			||||||
| 
						 | 
					@ -378,7 +376,7 @@ p Deserialize, i.e. import the document contents from a binary string.
 | 
				
			||||||
p
 | 
					p
 | 
				
			||||||
    |  Retokenize the document, such that the span at
 | 
					    |  Retokenize the document, such that the span at
 | 
				
			||||||
    |  #[code doc.text[start_idx : end_idx]] is merged into a single token. If
 | 
					    |  #[code doc.text[start_idx : end_idx]] is merged into a single token. If
 | 
				
			||||||
    |  #[code start_idx] and #[end_idx] do not mark start and end token
 | 
					    |  #[code start_idx] and #[code end_idx] do not mark start and end token
 | 
				
			||||||
    |  boundaries, the document remains unchanged.
 | 
					    |  boundaries, the document remains unchanged.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
+aside-code("Example").
 | 
					+aside-code("Example").
 | 
				
			||||||
| 
						 | 
					@ -405,7 +403,7 @@ p
 | 
				
			||||||
            |  attributes are inherited from the syntactic root token of
 | 
					            |  attributes are inherited from the syntactic root token of
 | 
				
			||||||
            |  the span.
 | 
					            |  the span.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell returns
 | 
					        +cell returns
 | 
				
			||||||
        +cell #[code Token]
 | 
					        +cell #[code Token]
 | 
				
			||||||
        +cell
 | 
					        +cell
 | 
				
			||||||
| 
						 | 
					@ -440,7 +438,7 @@ p
 | 
				
			||||||
        +cell bool
 | 
					        +cell bool
 | 
				
			||||||
        +cell Don't include arcs or modifiers.
 | 
					        +cell Don't include arcs or modifiers.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell returns
 | 
					        +cell returns
 | 
				
			||||||
        +cell dict
 | 
					        +cell dict
 | 
				
			||||||
        +cell Parse tree as dict.
 | 
					        +cell Parse tree as dict.
 | 
				
			||||||
| 
						 | 
					@ -462,7 +460,7 @@ p
 | 
				
			||||||
    assert ents[0].text == 'Mr. Best'
 | 
					    assert ents[0].text == 'Mr. Best'
 | 
				
			||||||
 | 
					
 | 
				
			||||||
+table(["Name", "Type", "Description"])
 | 
					+table(["Name", "Type", "Description"])
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell yields
 | 
					        +cell yields
 | 
				
			||||||
        +cell #[code Span]
 | 
					        +cell #[code Span]
 | 
				
			||||||
        +cell Entities in the document.
 | 
					        +cell Entities in the document.
 | 
				
			||||||
| 
						 | 
					@ -485,7 +483,7 @@ p
 | 
				
			||||||
    assert chunks[1].text == "another phrase"
 | 
					    assert chunks[1].text == "another phrase"
 | 
				
			||||||
 | 
					
 | 
				
			||||||
+table(["Name", "Type", "Description"])
 | 
					+table(["Name", "Type", "Description"])
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell yields
 | 
					        +cell yields
 | 
				
			||||||
        +cell #[code Span]
 | 
					        +cell #[code Span]
 | 
				
			||||||
        +cell Noun chunks in the document.
 | 
					        +cell Noun chunks in the document.
 | 
				
			||||||
| 
						 | 
					@ -507,7 +505,7 @@ p
 | 
				
			||||||
    assert [s.root.text for s in sents] == ["is", "'s"]
 | 
					    assert [s.root.text for s in sents] == ["is", "'s"]
 | 
				
			||||||
 | 
					
 | 
				
			||||||
+table(["Name", "Type", "Description"])
 | 
					+table(["Name", "Type", "Description"])
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell yields
 | 
					        +cell yields
 | 
				
			||||||
        +cell #[code Span]
 | 
					        +cell #[code Span]
 | 
				
			||||||
        +cell Sentences in the document.
 | 
					        +cell Sentences in the document.
 | 
				
			||||||
| 
						 | 
					@ -525,7 +523,7 @@ p
 | 
				
			||||||
    assert doc.has_vector
 | 
					    assert doc.has_vector
 | 
				
			||||||
 | 
					
 | 
				
			||||||
+table(["Name", "Type", "Description"])
 | 
					+table(["Name", "Type", "Description"])
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell returns
 | 
					        +cell returns
 | 
				
			||||||
        +cell bool
 | 
					        +cell bool
 | 
				
			||||||
        +cell Whether the document has a vector data attached.
 | 
					        +cell Whether the document has a vector data attached.
 | 
				
			||||||
| 
						 | 
					@ -544,7 +542,7 @@ p
 | 
				
			||||||
    assert doc.vector.shape == (300,)
 | 
					    assert doc.vector.shape == (300,)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
+table(["Name", "Type", "Description"])
 | 
					+table(["Name", "Type", "Description"])
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell returns
 | 
					        +cell returns
 | 
				
			||||||
        +cell #[code.u-break numpy.ndarray[ndim=1, dtype='float32']]
 | 
					        +cell #[code.u-break numpy.ndarray[ndim=1, dtype='float32']]
 | 
				
			||||||
        +cell A 1D numpy array representing the document's semantics.
 | 
					        +cell A 1D numpy array representing the document's semantics.
 | 
				
			||||||
| 
						 | 
					@ -564,7 +562,7 @@ p
 | 
				
			||||||
    assert doc1.vector_norm != doc2.vector_norm
 | 
					    assert doc1.vector_norm != doc2.vector_norm
 | 
				
			||||||
 | 
					
 | 
				
			||||||
+table(["Name", "Type", "Description"])
 | 
					+table(["Name", "Type", "Description"])
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell returns
 | 
					        +cell returns
 | 
				
			||||||
        +cell float
 | 
					        +cell float
 | 
				
			||||||
        +cell The L2 norm of the vector representation.
 | 
					        +cell The L2 norm of the vector representation.
 | 
				
			||||||
							
								
								
									
										5
									
								
								website/api/entityrecognizer.jade
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										5
									
								
								website/api/entityrecognizer.jade
									
									
									
									
									
										Normal file
									
								
							| 
						 | 
					@ -0,0 +1,5 @@
 | 
				
			||||||
 | 
					//- 💫 DOCS > API > ENTITYRECOGNIZER
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					include ../_includes/_mixins
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					!=partial("pipe", { subclass: "EntityRecognizer", short: "ner", pipeline_id: "ner" })
 | 
				
			||||||
| 
						 | 
					@ -1,14 +1,12 @@
 | 
				
			||||||
//- 💫 DOCS > API > GOLDCORPUS
 | 
					//- 💫 DOCS > API > GOLDCORPUS
 | 
				
			||||||
 | 
					
 | 
				
			||||||
include ../../_includes/_mixins
 | 
					include ../_includes/_mixins
 | 
				
			||||||
 | 
					
 | 
				
			||||||
p
 | 
					p
 | 
				
			||||||
    |  An annotated corpus, using the JSON file format. Manages annotations for
 | 
					    |  This class manages annotations for tagging, dependency parsing and NER.
 | 
				
			||||||
    |  tagging, dependency parsing and NER.
 | 
					 | 
				
			||||||
 | 
					
 | 
				
			||||||
+h(2, "init") GoldCorpus.__init__
 | 
					+h(2, "init") GoldCorpus.__init__
 | 
				
			||||||
    +tag method
 | 
					    +tag method
 | 
				
			||||||
    +tag-new(2)
 | 
					 | 
				
			||||||
 | 
					
 | 
				
			||||||
p Create a #[code GoldCorpus].
 | 
					p Create a #[code GoldCorpus].
 | 
				
			||||||
 | 
					
 | 
				
			||||||
| 
						 | 
					@ -1,6 +1,6 @@
 | 
				
			||||||
//- 💫 DOCS > API > GOLDPARSE
 | 
					//- 💫 DOCS > API > GOLDPARSE
 | 
				
			||||||
 | 
					
 | 
				
			||||||
include ../../_includes/_mixins
 | 
					include ../_includes/_mixins
 | 
				
			||||||
 | 
					
 | 
				
			||||||
p Collection for training annotations.
 | 
					p Collection for training annotations.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
| 
						 | 
					@ -40,7 +40,7 @@ p Create a #[code GoldParse].
 | 
				
			||||||
        +cell iterable
 | 
					        +cell iterable
 | 
				
			||||||
        +cell A sequence of named entity annotations, either as BILUO tag strings, or as #[code (start_char, end_char, label)] tuples, representing the entity positions.
 | 
					        +cell A sequence of named entity annotations, either as BILUO tag strings, or as #[code (start_char, end_char, label)] tuples, representing the entity positions.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell returns
 | 
					        +cell returns
 | 
				
			||||||
        +cell #[code GoldParse]
 | 
					        +cell #[code GoldParse]
 | 
				
			||||||
        +cell The newly constructed object.
 | 
					        +cell The newly constructed object.
 | 
				
			||||||
| 
						 | 
					@ -51,7 +51,7 @@ p Create a #[code GoldParse].
 | 
				
			||||||
p Get the number of gold-standard tokens.
 | 
					p Get the number of gold-standard tokens.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
+table(["Name", "Type", "Description"])
 | 
					+table(["Name", "Type", "Description"])
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell returns
 | 
					        +cell returns
 | 
				
			||||||
        +cell int
 | 
					        +cell int
 | 
				
			||||||
        +cell The number of gold-standard tokens.
 | 
					        +cell The number of gold-standard tokens.
 | 
				
			||||||
| 
						 | 
					@ -64,7 +64,7 @@ p
 | 
				
			||||||
    |  tree.
 | 
					    |  tree.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
+table(["Name", "Type", "Description"])
 | 
					+table(["Name", "Type", "Description"])
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell returns
 | 
					        +cell returns
 | 
				
			||||||
        +cell bool
 | 
					        +cell bool
 | 
				
			||||||
        +cell Whether annotations form projective tree.
 | 
					        +cell Whether annotations form projective tree.
 | 
				
			||||||
| 
						 | 
					@ -119,7 +119,7 @@ p
 | 
				
			||||||
 | 
					
 | 
				
			||||||
p
 | 
					p
 | 
				
			||||||
    |  Encode labelled spans into per-token tags, using the
 | 
					    |  Encode labelled spans into per-token tags, using the
 | 
				
			||||||
    |  #[+a("/docs/api/annotation#biluo") BILUO scheme] (Begin/In/Last/Unit/Out).
 | 
					    |  #[+a("/api/annotation#biluo") BILUO scheme] (Begin/In/Last/Unit/Out).
 | 
				
			||||||
 | 
					
 | 
				
			||||||
p
 | 
					p
 | 
				
			||||||
    |  Returns a list of unicode strings, describing the tags. Each tag string
 | 
					    |  Returns a list of unicode strings, describing the tags. Each tag string
 | 
				
			||||||
| 
						 | 
					@ -157,11 +157,11 @@ p
 | 
				
			||||||
            |  and #[code end] should be character-offset integers denoting the
 | 
					            |  and #[code end] should be character-offset integers denoting the
 | 
				
			||||||
            |  slice into the original string.
 | 
					            |  slice into the original string.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell returns
 | 
					        +cell returns
 | 
				
			||||||
        +cell list
 | 
					        +cell list
 | 
				
			||||||
        +cell
 | 
					        +cell
 | 
				
			||||||
            |  Unicode strings, describing the
 | 
					            |  Unicode strings, describing the
 | 
				
			||||||
            |  #[+a("/docs/api/annotation#biluo") BILUO] tags.
 | 
					            |  #[+a("/api/annotation#biluo") BILUO] tags.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
							
								
								
									
										14
									
								
								website/api/index.jade
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										14
									
								
								website/api/index.jade
									
									
									
									
									
										Normal file
									
								
							| 
						 | 
					@ -0,0 +1,14 @@
 | 
				
			||||||
 | 
					//- 💫 DOCS > API > ARCHITECTURE
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					include ../_includes/_mixins
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+section("basics")
 | 
				
			||||||
 | 
					    include ../usage/_spacy-101/_architecture
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+section("nn-model")
 | 
				
			||||||
 | 
					    +h(2, "nn-model") Neural network model architecture
 | 
				
			||||||
 | 
					    include _architecture/_nn-model
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+section("cython")
 | 
				
			||||||
 | 
					    +h(2, "cython") Cython conventions
 | 
				
			||||||
 | 
					    include _architecture/_cython
 | 
				
			||||||
| 
						 | 
					@ -1,10 +1,10 @@
 | 
				
			||||||
//- 💫 DOCS > API > LANGUAGE
 | 
					//- 💫 DOCS > API > LANGUAGE
 | 
				
			||||||
 | 
					
 | 
				
			||||||
include ../../_includes/_mixins
 | 
					include ../_includes/_mixins
 | 
				
			||||||
 | 
					
 | 
				
			||||||
p
 | 
					p
 | 
				
			||||||
    |  A text-processing pipeline. Usually you'll load this once per process,
 | 
					    |  Usually you'll load this once per process as #[code nlp] and pass the
 | 
				
			||||||
    |  and pass the instance around your application.
 | 
					    |  instance around your application.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
+h(2, "init") Language.__init__
 | 
					+h(2, "init") Language.__init__
 | 
				
			||||||
    +tag method
 | 
					    +tag method
 | 
				
			||||||
| 
						 | 
					@ -49,7 +49,7 @@ p Initialise a #[code Language] object.
 | 
				
			||||||
            |  Custom meta data for the #[code Language] class. Is written to by
 | 
					            |  Custom meta data for the #[code Language] class. Is written to by
 | 
				
			||||||
            |  models to add model meta data.
 | 
					            |  models to add model meta data.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell returns
 | 
					        +cell returns
 | 
				
			||||||
        +cell #[code Language]
 | 
					        +cell #[code Language]
 | 
				
			||||||
        +cell The newly constructed object.
 | 
					        +cell The newly constructed object.
 | 
				
			||||||
| 
						 | 
					@ -77,14 +77,14 @@ p
 | 
				
			||||||
        +cell list
 | 
					        +cell list
 | 
				
			||||||
        +cell
 | 
					        +cell
 | 
				
			||||||
            |  Names of pipeline components to
 | 
					            |  Names of pipeline components to
 | 
				
			||||||
            |  #[+a("/docs/usage/language-processing-pipeline#disabling") disable].
 | 
					            |  #[+a("/usage/processing-pipelines#disabling") disable].
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell returns
 | 
					        +cell returns
 | 
				
			||||||
        +cell #[code Doc]
 | 
					        +cell #[code Doc]
 | 
				
			||||||
        +cell A container for accessing the annotations.
 | 
					        +cell A container for accessing the annotations.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
+infobox("⚠️ Deprecation note")
 | 
					+infobox("Deprecation note", "⚠️")
 | 
				
			||||||
    .o-block
 | 
					    .o-block
 | 
				
			||||||
        |  Pipeline components to prevent from being loaded can now be added as
 | 
					        |  Pipeline components to prevent from being loaded can now be added as
 | 
				
			||||||
        |  a list to #[code disable], instead of specifying one keyword argument
 | 
					        |  a list to #[code disable], instead of specifying one keyword argument
 | 
				
			||||||
| 
						 | 
					@ -136,9 +136,9 @@ p
 | 
				
			||||||
        +cell list
 | 
					        +cell list
 | 
				
			||||||
        +cell
 | 
					        +cell
 | 
				
			||||||
            |  Names of pipeline components to
 | 
					            |  Names of pipeline components to
 | 
				
			||||||
            |  #[+a("/docs/usage/language-processing-pipeline#disabling") disable].
 | 
					            |  #[+a("/usage/processing-pipelines#disabling") disable].
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell yields
 | 
					        +cell yields
 | 
				
			||||||
        +cell #[code Doc]
 | 
					        +cell #[code Doc]
 | 
				
			||||||
        +cell Documents in the order of the original text.
 | 
					        +cell Documents in the order of the original text.
 | 
				
			||||||
| 
						 | 
					@ -175,7 +175,7 @@ p Update the models in the pipeline.
 | 
				
			||||||
        +cell callable
 | 
					        +cell callable
 | 
				
			||||||
        +cell An optimizer.
 | 
					        +cell An optimizer.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell returns
 | 
					        +cell returns
 | 
				
			||||||
        +cell dict
 | 
					        +cell dict
 | 
				
			||||||
        +cell Results from the update.
 | 
					        +cell Results from the update.
 | 
				
			||||||
| 
						 | 
					@ -200,7 +200,7 @@ p
 | 
				
			||||||
        +cell -
 | 
					        +cell -
 | 
				
			||||||
        +cell Config parameters.
 | 
					        +cell Config parameters.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell yields
 | 
					        +cell yields
 | 
				
			||||||
        +cell tuple
 | 
					        +cell tuple
 | 
				
			||||||
        +cell An optimizer.
 | 
					        +cell An optimizer.
 | 
				
			||||||
| 
						 | 
					@ -242,7 +242,7 @@ p
 | 
				
			||||||
        +cell iterable
 | 
					        +cell iterable
 | 
				
			||||||
        +cell Tuples of #[code Doc] and #[code GoldParse] objects.
 | 
					        +cell Tuples of #[code Doc] and #[code GoldParse] objects.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell yields
 | 
					        +cell yields
 | 
				
			||||||
        +cell tuple
 | 
					        +cell tuple
 | 
				
			||||||
        +cell Tuples of #[code Doc] and #[code GoldParse] objects.
 | 
					        +cell Tuples of #[code Doc] and #[code GoldParse] objects.
 | 
				
			||||||
| 
						 | 
					@ -271,7 +271,7 @@ p
 | 
				
			||||||
        +cell list
 | 
					        +cell list
 | 
				
			||||||
        +cell
 | 
					        +cell
 | 
				
			||||||
            |  Names of pipeline components to
 | 
					            |  Names of pipeline components to
 | 
				
			||||||
            |  #[+a("/docs/usage/language-processing-pipeline#disabling") disable]
 | 
					            |  #[+a("/usage/processing-pipelines#disabling") disable]
 | 
				
			||||||
            |  and prevent from being saved.
 | 
					            |  and prevent from being saved.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
+h(2, "from_disk") Language.from_disk
 | 
					+h(2, "from_disk") Language.from_disk
 | 
				
			||||||
| 
						 | 
					@ -300,14 +300,14 @@ p
 | 
				
			||||||
        +cell list
 | 
					        +cell list
 | 
				
			||||||
        +cell
 | 
					        +cell
 | 
				
			||||||
            |  Names of pipeline components to
 | 
					            |  Names of pipeline components to
 | 
				
			||||||
            |  #[+a("/docs/usage/language-processing-pipeline#disabling") disable].
 | 
					            |  #[+a("/usage/processing-pipelines#disabling") disable].
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell returns
 | 
					        +cell returns
 | 
				
			||||||
        +cell #[code Language]
 | 
					        +cell #[code Language]
 | 
				
			||||||
        +cell The modified #[code Language] object.
 | 
					        +cell The modified #[code Language] object.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
+infobox("⚠️ Deprecation note")
 | 
					+infobox("Deprecation note", "⚠️")
 | 
				
			||||||
    .o-block
 | 
					    .o-block
 | 
				
			||||||
        |  As of spaCy v2.0, the #[code save_to_directory] method has been
 | 
					        |  As of spaCy v2.0, the #[code save_to_directory] method has been
 | 
				
			||||||
        |  renamed to #[code to_disk], to improve consistency across classes.
 | 
					        |  renamed to #[code to_disk], to improve consistency across classes.
 | 
				
			||||||
| 
						 | 
					@ -332,10 +332,10 @@ p Serialize the current state to a binary string.
 | 
				
			||||||
        +cell list
 | 
					        +cell list
 | 
				
			||||||
        +cell
 | 
					        +cell
 | 
				
			||||||
            |  Names of pipeline components to
 | 
					            |  Names of pipeline components to
 | 
				
			||||||
            |  #[+a("/docs/usage/language-processing-pipeline#disabling") disable]
 | 
					            |  #[+a("/usage/processing-pipelines#disabling") disable]
 | 
				
			||||||
            |  and prevent from being serialized.
 | 
					            |  and prevent from being serialized.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell returns
 | 
					        +cell returns
 | 
				
			||||||
        +cell bytes
 | 
					        +cell bytes
 | 
				
			||||||
        +cell The serialized form of the #[code Language] object.
 | 
					        +cell The serialized form of the #[code Language] object.
 | 
				
			||||||
| 
						 | 
					@ -362,14 +362,14 @@ p Load state from a binary string.
 | 
				
			||||||
        +cell list
 | 
					        +cell list
 | 
				
			||||||
        +cell
 | 
					        +cell
 | 
				
			||||||
            |  Names of pipeline components to
 | 
					            |  Names of pipeline components to
 | 
				
			||||||
            |  #[+a("/docs/usage/language-processing-pipeline#disabling") disable].
 | 
					            |  #[+a("/usage/processing-pipelines#disabling") disable].
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell returns
 | 
					        +cell returns
 | 
				
			||||||
        +cell #[code Language]
 | 
					        +cell #[code Language]
 | 
				
			||||||
        +cell The #[code Language] object.
 | 
					        +cell The #[code Language] object.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
+infobox("⚠️ Deprecation note")
 | 
					+infobox("Deprecation note", "⚠️")
 | 
				
			||||||
    .o-block
 | 
					    .o-block
 | 
				
			||||||
        |  Pipeline components to prevent from being loaded can now be added as
 | 
					        |  Pipeline components to prevent from being loaded can now be added as
 | 
				
			||||||
        |  a list to #[code disable], instead of specifying one keyword argument
 | 
					        |  a list to #[code disable], instead of specifying one keyword argument
 | 
				
			||||||
							
								
								
									
										5
									
								
								website/api/lemmatizer.jade
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										5
									
								
								website/api/lemmatizer.jade
									
									
									
									
									
										Normal file
									
								
							| 
						 | 
					@ -0,0 +1,5 @@
 | 
				
			||||||
 | 
					//- 💫 DOCS > API > LEMMATIZER
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					include ../_includes/_mixins
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+under-construction
 | 
				
			||||||
| 
						 | 
					@ -1,6 +1,6 @@
 | 
				
			||||||
//- 💫 DOCS > API > LEXEME
 | 
					//- 💫 DOCS > API > LEXEME
 | 
				
			||||||
 | 
					
 | 
				
			||||||
include ../../_includes/_mixins
 | 
					include ../_includes/_mixins
 | 
				
			||||||
 | 
					
 | 
				
			||||||
p
 | 
					p
 | 
				
			||||||
    |  An entry in the vocabulary. A #[code Lexeme] has no string context – it's
 | 
					    |  An entry in the vocabulary. A #[code Lexeme] has no string context – it's
 | 
				
			||||||
| 
						 | 
					@ -24,7 +24,7 @@ p Create a #[code Lexeme] object.
 | 
				
			||||||
        +cell int
 | 
					        +cell int
 | 
				
			||||||
        +cell The orth id of the lexeme.
 | 
					        +cell The orth id of the lexeme.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell returns
 | 
					        +cell returns
 | 
				
			||||||
        +cell #[code Lexeme]
 | 
					        +cell #[code Lexeme]
 | 
				
			||||||
        +cell The newly constructed object.
 | 
					        +cell The newly constructed object.
 | 
				
			||||||
| 
						 | 
					@ -65,7 +65,7 @@ p Check the value of a boolean flag.
 | 
				
			||||||
        +cell int
 | 
					        +cell int
 | 
				
			||||||
        +cell The attribute ID of the flag to query.
 | 
					        +cell The attribute ID of the flag to query.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell returns
 | 
					        +cell returns
 | 
				
			||||||
        +cell bool
 | 
					        +cell bool
 | 
				
			||||||
        +cell The value of the flag.
 | 
					        +cell The value of the flag.
 | 
				
			||||||
| 
						 | 
					@ -91,7 +91,7 @@ p Compute a semantic similarity estimate. Defaults to cosine over vectors.
 | 
				
			||||||
            |  The object to compare with. By default, accepts #[code Doc],
 | 
					            |  The object to compare with. By default, accepts #[code Doc],
 | 
				
			||||||
            |  #[code Span], #[code Token] and #[code Lexeme] objects.
 | 
					            |  #[code Span], #[code Token] and #[code Lexeme] objects.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell returns
 | 
					        +cell returns
 | 
				
			||||||
        +cell float
 | 
					        +cell float
 | 
				
			||||||
        +cell A scalar similarity score. Higher is more similar.
 | 
					        +cell A scalar similarity score. Higher is more similar.
 | 
				
			||||||
| 
						 | 
					@ -110,7 +110,7 @@ p
 | 
				
			||||||
    assert apple.has_vector
 | 
					    assert apple.has_vector
 | 
				
			||||||
 | 
					
 | 
				
			||||||
+table(["Name", "Type", "Description"])
 | 
					+table(["Name", "Type", "Description"])
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell returns
 | 
					        +cell returns
 | 
				
			||||||
        +cell bool
 | 
					        +cell bool
 | 
				
			||||||
        +cell Whether the lexeme has a vector data attached.
 | 
					        +cell Whether the lexeme has a vector data attached.
 | 
				
			||||||
| 
						 | 
					@ -127,7 +127,7 @@ p A real-valued meaning representation.
 | 
				
			||||||
    assert apple.vector.shape == (300,)
 | 
					    assert apple.vector.shape == (300,)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
+table(["Name", "Type", "Description"])
 | 
					+table(["Name", "Type", "Description"])
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell returns
 | 
					        +cell returns
 | 
				
			||||||
        +cell #[code.u-break numpy.ndarray[ndim=1, dtype='float32']]
 | 
					        +cell #[code.u-break numpy.ndarray[ndim=1, dtype='float32']]
 | 
				
			||||||
        +cell A 1D numpy array representing the lexeme's semantics.
 | 
					        +cell A 1D numpy array representing the lexeme's semantics.
 | 
				
			||||||
| 
						 | 
					@ -146,7 +146,7 @@ p The L2 norm of the lexeme's vector representation.
 | 
				
			||||||
    assert apple.vector_norm != pasta.vector_norm
 | 
					    assert apple.vector_norm != pasta.vector_norm
 | 
				
			||||||
 | 
					
 | 
				
			||||||
+table(["Name", "Type", "Description"])
 | 
					+table(["Name", "Type", "Description"])
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell returns
 | 
					        +cell returns
 | 
				
			||||||
        +cell float
 | 
					        +cell float
 | 
				
			||||||
        +cell The L2 norm of the vector representation.
 | 
					        +cell The L2 norm of the vector representation.
 | 
				
			||||||
| 
						 | 
					@ -1,10 +1,8 @@
 | 
				
			||||||
//- 💫 DOCS > API > MATCHER
 | 
					//- 💫 DOCS > API > MATCHER
 | 
				
			||||||
 | 
					
 | 
				
			||||||
include ../../_includes/_mixins
 | 
					include ../_includes/_mixins
 | 
				
			||||||
 | 
					
 | 
				
			||||||
p Match sequences of tokens, based on pattern rules.
 | 
					+infobox("Deprecation note", "⚠️")
 | 
				
			||||||
 | 
					 | 
				
			||||||
+infobox("⚠️ Deprecation note")
 | 
					 | 
				
			||||||
    |  As of spaCy 2.0, #[code Matcher.add_pattern] and #[code Matcher.add_entity]
 | 
					    |  As of spaCy 2.0, #[code Matcher.add_pattern] and #[code Matcher.add_entity]
 | 
				
			||||||
    |  are deprecated and have been replaced with a simpler
 | 
					    |  are deprecated and have been replaced with a simpler
 | 
				
			||||||
    |  #[+api("matcher#add") #[code Matcher.add]] that lets you add a list of
 | 
					    |  #[+api("matcher#add") #[code Matcher.add]] that lets you add a list of
 | 
				
			||||||
| 
						 | 
					@ -39,7 +37,7 @@ p Create the rule-based #[code Matcher].
 | 
				
			||||||
        +cell dict
 | 
					        +cell dict
 | 
				
			||||||
        +cell Patterns to add to the matcher, keyed by ID.
 | 
					        +cell Patterns to add to the matcher, keyed by ID.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell returns
 | 
					        +cell returns
 | 
				
			||||||
        +cell #[code Matcher]
 | 
					        +cell #[code Matcher]
 | 
				
			||||||
        +cell The newly constructed object.
 | 
					        +cell The newly constructed object.
 | 
				
			||||||
| 
						 | 
					@ -64,7 +62,7 @@ p Find all token sequences matching the supplied patterns on the #[code Doc].
 | 
				
			||||||
        +cell #[code Doc]
 | 
					        +cell #[code Doc]
 | 
				
			||||||
        +cell The document to match over.
 | 
					        +cell The document to match over.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell returns
 | 
					        +cell returns
 | 
				
			||||||
        +cell list
 | 
					        +cell list
 | 
				
			||||||
        +cell
 | 
					        +cell
 | 
				
			||||||
| 
						 | 
					@ -81,7 +79,7 @@ p Find all token sequences matching the supplied patterns on the #[code Doc].
 | 
				
			||||||
    |  actions per pattern within the same matcher. For example, you might only
 | 
					    |  actions per pattern within the same matcher. For example, you might only
 | 
				
			||||||
    |  want to merge some entity types, and set custom flags for other matched
 | 
					    |  want to merge some entity types, and set custom flags for other matched
 | 
				
			||||||
    |  patterns. For more details and examples, see the usage guide on
 | 
					    |  patterns. For more details and examples, see the usage guide on
 | 
				
			||||||
    |  #[+a("/docs/usage/rule-based-matching") rule-based matching].
 | 
					    |  #[+a("/usage/linguistic-features#rule-based-matching") rule-based matching].
 | 
				
			||||||
 | 
					
 | 
				
			||||||
+h(2, "pipe") Matcher.pipe
 | 
					+h(2, "pipe") Matcher.pipe
 | 
				
			||||||
    +tag method
 | 
					    +tag method
 | 
				
			||||||
| 
						 | 
					@ -113,7 +111,7 @@ p Match a stream of documents, yielding them in turn.
 | 
				
			||||||
            |  parallel, if the #[code Matcher] implementation supports
 | 
					            |  parallel, if the #[code Matcher] implementation supports
 | 
				
			||||||
            |  multi-threading.
 | 
					            |  multi-threading.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell yields
 | 
					        +cell yields
 | 
				
			||||||
        +cell #[code Doc]
 | 
					        +cell #[code Doc]
 | 
				
			||||||
        +cell Documents, in order.
 | 
					        +cell Documents, in order.
 | 
				
			||||||
| 
						 | 
					@ -134,7 +132,7 @@ p
 | 
				
			||||||
    assert len(matcher) == 1
 | 
					    assert len(matcher) == 1
 | 
				
			||||||
 | 
					
 | 
				
			||||||
+table(["Name", "Type", "Description"])
 | 
					+table(["Name", "Type", "Description"])
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell returns
 | 
					        +cell returns
 | 
				
			||||||
        +cell int
 | 
					        +cell int
 | 
				
			||||||
        +cell The number of rules.
 | 
					        +cell The number of rules.
 | 
				
			||||||
| 
						 | 
					@ -156,7 +154,8 @@ p Check whether the matcher contains rules for a match ID.
 | 
				
			||||||
        +cell #[code key]
 | 
					        +cell #[code key]
 | 
				
			||||||
        +cell unicode
 | 
					        +cell unicode
 | 
				
			||||||
        +cell The match ID.
 | 
					        +cell The match ID.
 | 
				
			||||||
    +footrow
 | 
					
 | 
				
			||||||
 | 
					    +row("foot")
 | 
				
			||||||
        +cell returns
 | 
					        +cell returns
 | 
				
			||||||
        +cell int
 | 
					        +cell int
 | 
				
			||||||
        +cell Whether the matcher contains rules for this match ID.
 | 
					        +cell Whether the matcher contains rules for this match ID.
 | 
				
			||||||
| 
						 | 
					@ -203,7 +202,7 @@ p
 | 
				
			||||||
            |  Match pattern. A pattern consists of a list of dicts, where each
 | 
					            |  Match pattern. A pattern consists of a list of dicts, where each
 | 
				
			||||||
            |  dict describes a token.
 | 
					            |  dict describes a token.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
+infobox("⚠️ Deprecation note")
 | 
					+infobox("Deprecation note", "⚠️")
 | 
				
			||||||
    .o-block
 | 
					    .o-block
 | 
				
			||||||
        |  As of spaCy 2.0, #[code Matcher.add_pattern] and #[code Matcher.add_entity]
 | 
					        |  As of spaCy 2.0, #[code Matcher.add_pattern] and #[code Matcher.add_entity]
 | 
				
			||||||
        |  are deprecated and have been replaced with a simpler
 | 
					        |  are deprecated and have been replaced with a simpler
 | 
				
			||||||
| 
						 | 
					@ -257,7 +256,7 @@ p
 | 
				
			||||||
        +cell unicode
 | 
					        +cell unicode
 | 
				
			||||||
        +cell The ID of the match rule.
 | 
					        +cell The ID of the match rule.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell returns
 | 
					        +cell returns
 | 
				
			||||||
        +cell tuple
 | 
					        +cell tuple
 | 
				
			||||||
        +cell The rule, as an #[code (on_match, patterns)] tuple.
 | 
					        +cell The rule, as an #[code (on_match, patterns)] tuple.
 | 
				
			||||||
							
								
								
									
										181
									
								
								website/api/phrasematcher.jade
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										181
									
								
								website/api/phrasematcher.jade
									
									
									
									
									
										Normal file
									
								
							| 
						 | 
					@ -0,0 +1,181 @@
 | 
				
			||||||
 | 
					//- 💫 DOCS > API > PHRASEMATCHER
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					include ../_includes/_mixins
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					p
 | 
				
			||||||
 | 
					    |  The #[code PhraseMatcher] lets you efficiently match large terminology
 | 
				
			||||||
 | 
					    |  lists. While the #[+api("matcher") #[code Matcher]] lets you match
 | 
				
			||||||
 | 
					    |  squences based on lists of token descriptions, the #[code PhraseMatcher]
 | 
				
			||||||
 | 
					    |  accepts match patterns in the form of #[code Doc] objects.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+h(2, "init") PhraseMatcher.__init__
 | 
				
			||||||
 | 
					    +tag method
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					p Create the rule-based #[code PhraseMatcher].
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+aside-code("Example").
 | 
				
			||||||
 | 
					    from spacy.matcher import PhraseMatcher
 | 
				
			||||||
 | 
					    matcher = Matcher(nlp.vocab, max_length=6)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+table(["Name", "Type", "Description"])
 | 
				
			||||||
 | 
					    +row
 | 
				
			||||||
 | 
					        +cell #[code vocab]
 | 
				
			||||||
 | 
					        +cell #[code Vocab]
 | 
				
			||||||
 | 
					        +cell
 | 
				
			||||||
 | 
					            |  The vocabulary object, which must be shared with the documents
 | 
				
			||||||
 | 
					            |  the matcher will operate on.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    +row
 | 
				
			||||||
 | 
					        +cell #[code max_length]
 | 
				
			||||||
 | 
					        +cell int
 | 
				
			||||||
 | 
					        +cell Mamimum length of a phrase pattern to add.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    +row("foot")
 | 
				
			||||||
 | 
					        +cell returns
 | 
				
			||||||
 | 
					        +cell #[code PhraseMatcher]
 | 
				
			||||||
 | 
					        +cell The newly constructed object.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+h(2, "call") PhraseMatcher.__call__
 | 
				
			||||||
 | 
					    +tag method
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					p Find all token sequences matching the supplied patterns on the #[code Doc].
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+aside-code("Example").
 | 
				
			||||||
 | 
					    from spacy.matcher import Matcher
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    matcher = Matcher(nlp.vocab)
 | 
				
			||||||
 | 
					    matcher.add('OBAMA', None, nlp(u"Barack Obama"))
 | 
				
			||||||
 | 
					    doc = nlp(u"Barack Obama lifts America one last time in emotional farewell")
 | 
				
			||||||
 | 
					    matches = matcher(doc)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+table(["Name", "Type", "Description"])
 | 
				
			||||||
 | 
					    +row
 | 
				
			||||||
 | 
					        +cell #[code doc]
 | 
				
			||||||
 | 
					        +cell #[code Doc]
 | 
				
			||||||
 | 
					        +cell The document to match over.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    +row("foot")
 | 
				
			||||||
 | 
					        +cell returns
 | 
				
			||||||
 | 
					        +cell list
 | 
				
			||||||
 | 
					        +cell
 | 
				
			||||||
 | 
					            |  A list of #[code (match_id, start, end)] tuples, describing the
 | 
				
			||||||
 | 
					            |  matches. A match tuple describes a span #[code doc[start:end]].
 | 
				
			||||||
 | 
					            |  The #[code match_id] is the ID of the added match pattern.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+h(2, "pipe") PhraseMatcher.pipe
 | 
				
			||||||
 | 
					    +tag method
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					p Match a stream of documents, yielding them in turn.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+aside-code("Example").
 | 
				
			||||||
 | 
					    from spacy.matcher import PhraseMatcher
 | 
				
			||||||
 | 
					    matcher = PhraseMatcher(nlp.vocab)
 | 
				
			||||||
 | 
					    for doc in matcher.pipe(texts, batch_size=50, n_threads=4):
 | 
				
			||||||
 | 
					        pass
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+table(["Name", "Type", "Description"])
 | 
				
			||||||
 | 
					    +row
 | 
				
			||||||
 | 
					        +cell #[code docs]
 | 
				
			||||||
 | 
					        +cell iterable
 | 
				
			||||||
 | 
					        +cell A stream of documents.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    +row
 | 
				
			||||||
 | 
					        +cell #[code batch_size]
 | 
				
			||||||
 | 
					        +cell int
 | 
				
			||||||
 | 
					        +cell The number of documents to accumulate into a working set.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    +row
 | 
				
			||||||
 | 
					        +cell #[code n_threads]
 | 
				
			||||||
 | 
					        +cell int
 | 
				
			||||||
 | 
					        +cell
 | 
				
			||||||
 | 
					            |  The number of threads with which to work on the buffer in
 | 
				
			||||||
 | 
					            |  parallel, if the #[code PhraseMatcher] implementation supports
 | 
				
			||||||
 | 
					            |  multi-threading.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    +row("foot")
 | 
				
			||||||
 | 
					        +cell yields
 | 
				
			||||||
 | 
					        +cell #[code Doc]
 | 
				
			||||||
 | 
					        +cell Documents, in order.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+h(2, "len") PhraseMatcher.__len__
 | 
				
			||||||
 | 
					    +tag method
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					p
 | 
				
			||||||
 | 
					    |  Get the number of rules added to the matcher. Note that this only returns
 | 
				
			||||||
 | 
					    |  the number of rules (identical with the number of IDs), not the number
 | 
				
			||||||
 | 
					    |  of individual patterns.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+aside-code("Example").
 | 
				
			||||||
 | 
					    matcher = PhraseMatcher(nlp.vocab)
 | 
				
			||||||
 | 
					    assert len(matcher) == 0
 | 
				
			||||||
 | 
					    matcher.add('OBAMA', None, nlp(u"Barack Obama"))
 | 
				
			||||||
 | 
					    assert len(matcher) == 1
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+table(["Name", "Type", "Description"])
 | 
				
			||||||
 | 
					    +row("foot")
 | 
				
			||||||
 | 
					        +cell returns
 | 
				
			||||||
 | 
					        +cell int
 | 
				
			||||||
 | 
					        +cell The number of rules.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+h(2, "contains") PhraseMatcher.__contains__
 | 
				
			||||||
 | 
					    +tag method
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					p Check whether the matcher contains rules for a match ID.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+aside-code("Example").
 | 
				
			||||||
 | 
					    matcher = PhraseMatcher(nlp.vocab)
 | 
				
			||||||
 | 
					    assert len(matcher) == 0
 | 
				
			||||||
 | 
					    matcher.add('OBAMA', None, nlp(u"Barack Obama"))
 | 
				
			||||||
 | 
					    assert len(matcher) == 1
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+table(["Name", "Type", "Description"])
 | 
				
			||||||
 | 
					    +row
 | 
				
			||||||
 | 
					        +cell #[code key]
 | 
				
			||||||
 | 
					        +cell unicode
 | 
				
			||||||
 | 
					        +cell The match ID.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    +row("foot")
 | 
				
			||||||
 | 
					        +cell returns
 | 
				
			||||||
 | 
					        +cell int
 | 
				
			||||||
 | 
					        +cell Whether the matcher contains rules for this match ID.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+h(2, "add") PhraseMatcher.add
 | 
				
			||||||
 | 
					    +tag method
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					p
 | 
				
			||||||
 | 
					    |  Add a rule to the matcher, consisting of an ID key, one or more patterns, and
 | 
				
			||||||
 | 
					    |  a callback function to act on the matches. The callback function will
 | 
				
			||||||
 | 
					    |  receive the arguments #[code matcher], #[code doc], #[code i] and
 | 
				
			||||||
 | 
					    |  #[code matches]. If a pattern already exists for the given ID, the
 | 
				
			||||||
 | 
					    |  patterns will be extended. An #[code on_match] callback will be
 | 
				
			||||||
 | 
					    |  overwritten.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+aside-code("Example").
 | 
				
			||||||
 | 
					    def on_match(matcher, doc, id, matches):
 | 
				
			||||||
 | 
					        print('Matched!', matches)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    matcher = PhraseMatcher(nlp.vocab)
 | 
				
			||||||
 | 
					    matcher.add('OBAMA', on_match, nlp(u"Barack Obama"))
 | 
				
			||||||
 | 
					    matcher.add('HEALTH', on_match, nlp(u"health care reform"),
 | 
				
			||||||
 | 
					                                    nlp(u"healthcare reform"))
 | 
				
			||||||
 | 
					    doc = nlp(u"Barack Obama urges Congress to find courage to defend his healthcare reforms")
 | 
				
			||||||
 | 
					    matches = matcher(doc)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+table(["Name", "Type", "Description"])
 | 
				
			||||||
 | 
					    +row
 | 
				
			||||||
 | 
					        +cell #[code match_id]
 | 
				
			||||||
 | 
					        +cell unicode
 | 
				
			||||||
 | 
					        +cell An ID for the thing you're matching.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    +row
 | 
				
			||||||
 | 
					        +cell #[code on_match]
 | 
				
			||||||
 | 
					        +cell callable or #[code None]
 | 
				
			||||||
 | 
					        +cell
 | 
				
			||||||
 | 
					            |  Callback function to act on matches. Takes the arguments
 | 
				
			||||||
 | 
					            |  #[code matcher], #[code doc], #[code i] and #[code matches].
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    +row
 | 
				
			||||||
 | 
					        +cell #[code *docs]
 | 
				
			||||||
 | 
					        +cell list
 | 
				
			||||||
 | 
					        +cell
 | 
				
			||||||
 | 
					            |  #[code Doc] objects of the phrases to match.
 | 
				
			||||||
							
								
								
									
										390
									
								
								website/api/pipe.jade
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										390
									
								
								website/api/pipe.jade
									
									
									
									
									
										Normal file
									
								
							| 
						 | 
					@ -0,0 +1,390 @@
 | 
				
			||||||
 | 
					//- 💫 DOCS > API > PIPE
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					include ../_includes/_mixins
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					//- This page can be used as a template for all other classes that inherit
 | 
				
			||||||
 | 
					//-  from `Pipe`.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					if subclass
 | 
				
			||||||
 | 
					    +infobox
 | 
				
			||||||
 | 
					        |  This class is a subclass of #[+api("pipe") #[code Pipe]] and
 | 
				
			||||||
 | 
					        |  follows the same API. The pipeline component is available in the
 | 
				
			||||||
 | 
					        |  #[+a("/usage/processing-pipelines") processing pipeline] via the ID
 | 
				
			||||||
 | 
					        |  #[code "#{pipeline_id}"].
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					else
 | 
				
			||||||
 | 
					    p
 | 
				
			||||||
 | 
					        |  This class is not instantiated directly. Components inherit from it,
 | 
				
			||||||
 | 
					        |  and it defines the interface that components should follow to
 | 
				
			||||||
 | 
					        |  function as components in a spaCy analysis pipeline.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					- CLASSNAME = subclass || 'Pipe'
 | 
				
			||||||
 | 
					- VARNAME = short || CLASSNAME.toLowerCase()
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+h(2, "model") #{CLASSNAME}.Model
 | 
				
			||||||
 | 
					    +tag classmethod
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					p
 | 
				
			||||||
 | 
					    |  Initialise a model for the pipe. The model should implement the
 | 
				
			||||||
 | 
					    |  #[code thinc.neural.Model] API. Wrappers are available for
 | 
				
			||||||
 | 
					    |  #[+a("/usage/deep-learning") most major machine learning libraries].
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+table(["Name", "Type", "Description"])
 | 
				
			||||||
 | 
					    +row
 | 
				
			||||||
 | 
					        +cell #[code **kwargs]
 | 
				
			||||||
 | 
					        +cell -
 | 
				
			||||||
 | 
					        +cell Parameters for initialising the model
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    +row("foot")
 | 
				
			||||||
 | 
					        +cell returns
 | 
				
			||||||
 | 
					        +cell object
 | 
				
			||||||
 | 
					        +cell The initialised model.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+h(2, "init") #{CLASSNAME}.__init__
 | 
				
			||||||
 | 
					    +tag method
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					p Create a new pipeline instance.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+aside-code("Example").
 | 
				
			||||||
 | 
					    from spacy.pipeline import #{CLASSNAME}
 | 
				
			||||||
 | 
					    #{VARNAME} = #{CLASSNAME}(nlp.vocab)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+table(["Name", "Type", "Description"])
 | 
				
			||||||
 | 
					    +row
 | 
				
			||||||
 | 
					        +cell #[code vocab]
 | 
				
			||||||
 | 
					        +cell #[code Vocab]
 | 
				
			||||||
 | 
					        +cell The shared vocabulary.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    +row
 | 
				
			||||||
 | 
					        +cell #[code model]
 | 
				
			||||||
 | 
					        +cell #[code thinc.neural.Model] or #[code True]
 | 
				
			||||||
 | 
					        +cell
 | 
				
			||||||
 | 
					            |  The model powering the pipeline component. If no model is
 | 
				
			||||||
 | 
					            |  supplied, the model is created when you call
 | 
				
			||||||
 | 
					            |  #[code begin_training], #[code from_disk] or #[code from_bytes].
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    +row
 | 
				
			||||||
 | 
					        +cell #[code **cfg]
 | 
				
			||||||
 | 
					        +cell -
 | 
				
			||||||
 | 
					        +cell Configuration parameters.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    +row("foot")
 | 
				
			||||||
 | 
					        +cell returns
 | 
				
			||||||
 | 
					        +cell #[code=CLASSNAME]
 | 
				
			||||||
 | 
					        +cell The newly constructed object.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+h(2, "call") #{CLASSNAME}.__call__
 | 
				
			||||||
 | 
					    +tag method
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					p
 | 
				
			||||||
 | 
					    |  Apply the pipe to one document. The document is modified in place, and
 | 
				
			||||||
 | 
					    |  returned. Both #[code #{CLASSNAME}.__call__] and
 | 
				
			||||||
 | 
					    |  #[code #{CLASSNAME}.pipe] should delegate to the
 | 
				
			||||||
 | 
					    |  #[code #{CLASSNAME}.predict] and #[code #{CLASSNAME}.set_annotations]
 | 
				
			||||||
 | 
					    |  methods.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+aside-code("Example").
 | 
				
			||||||
 | 
					    #{VARNAME} = #{CLASSNAME}(nlp.vocab)
 | 
				
			||||||
 | 
					    doc = nlp(u"This is a sentence.")
 | 
				
			||||||
 | 
					    processed = #{VARNAME}(doc)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+table(["Name", "Type", "Description"])
 | 
				
			||||||
 | 
					    +row
 | 
				
			||||||
 | 
					        +cell #[code doc]
 | 
				
			||||||
 | 
					        +cell #[code Doc]
 | 
				
			||||||
 | 
					        +cell The document to process.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    +row("foot")
 | 
				
			||||||
 | 
					        +cell returns
 | 
				
			||||||
 | 
					        +cell #[code Doc]
 | 
				
			||||||
 | 
					        +cell The processed document.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+h(2, "pipe") #{CLASSNAME}.pipe
 | 
				
			||||||
 | 
					    +tag method
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					p
 | 
				
			||||||
 | 
					    |  Apply the pipe to a stream of documents. Both
 | 
				
			||||||
 | 
					    |  #[code #{CLASSNAME}.__call__] and #[code #{CLASSNAME}.pipe] should
 | 
				
			||||||
 | 
					    |  delegate to the #[code #{CLASSNAME}.predict] and
 | 
				
			||||||
 | 
					    |  #[code #{CLASSNAME}.set_annotations] methods.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+aside-code("Example").
 | 
				
			||||||
 | 
					    texts = [u'One doc', u'...', u'Lots of docs']
 | 
				
			||||||
 | 
					    #{VARNAME} = #{CLASSNAME}(nlp.vocab)
 | 
				
			||||||
 | 
					    for doc in #{VARNAME}.pipe(texts, batch_size=50):
 | 
				
			||||||
 | 
					        pass
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+table(["Name", "Type", "Description"])
 | 
				
			||||||
 | 
					    +row
 | 
				
			||||||
 | 
					        +cell #[code stream]
 | 
				
			||||||
 | 
					        +cell iterable
 | 
				
			||||||
 | 
					        +cell A stream of documents.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    +row
 | 
				
			||||||
 | 
					        +cell #[code batch_size]
 | 
				
			||||||
 | 
					        +cell int
 | 
				
			||||||
 | 
					        +cell The number of texts to buffer. Defaults to #[code 128].
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    +row
 | 
				
			||||||
 | 
					        +cell #[code n_threads]
 | 
				
			||||||
 | 
					        +cell int
 | 
				
			||||||
 | 
					        +cell
 | 
				
			||||||
 | 
					            |  The number of worker threads to use. If #[code -1], OpenMP will
 | 
				
			||||||
 | 
					            |  decide how many to use at run time. Default is #[code -1].
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    +row("foot")
 | 
				
			||||||
 | 
					        +cell yields
 | 
				
			||||||
 | 
					        +cell #[code Doc]
 | 
				
			||||||
 | 
					        +cell Processed documents in the order of the original text.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+h(2, "predict") #{CLASSNAME}.predict
 | 
				
			||||||
 | 
					    +tag method
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					p
 | 
				
			||||||
 | 
					    |  Apply the pipeline's model to a batch of docs, without modifying them.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+aside-code("Example").
 | 
				
			||||||
 | 
					    #{VARNAME} = #{CLASSNAME}(nlp.vocab)
 | 
				
			||||||
 | 
					    scores = #{VARNAME}.predict([doc1, doc2])
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+table(["Name", "Type", "Description"])
 | 
				
			||||||
 | 
					    +row
 | 
				
			||||||
 | 
					        +cell #[code docs]
 | 
				
			||||||
 | 
					        +cell iterable
 | 
				
			||||||
 | 
					        +cell The documents to predict.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    +row("foot")
 | 
				
			||||||
 | 
					        +cell returns
 | 
				
			||||||
 | 
					        +cell -
 | 
				
			||||||
 | 
					        +cell Scores from the model.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+h(2, "set_annotations") #{CLASSNAME}.set_annotations
 | 
				
			||||||
 | 
					    +tag method
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					p
 | 
				
			||||||
 | 
					    |  Modify a batch of documents, using pre-computed scores.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+aside-code("Example").
 | 
				
			||||||
 | 
					    #{VARNAME} = #{CLASSNAME}(nlp.vocab)
 | 
				
			||||||
 | 
					    scores = #{VARNAME}.predict([doc1, doc2])
 | 
				
			||||||
 | 
					    #{VARNAME}.set_annotations([doc1, doc2], scores)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+table(["Name", "Type", "Description"])
 | 
				
			||||||
 | 
					    +row
 | 
				
			||||||
 | 
					        +cell #[code docs]
 | 
				
			||||||
 | 
					        +cell iterable
 | 
				
			||||||
 | 
					        +cell The documents to modify.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    +row
 | 
				
			||||||
 | 
					        +cell #[code scores]
 | 
				
			||||||
 | 
					        +cell -
 | 
				
			||||||
 | 
					        +cell The scores to set, produced by #[code #{CLASSNAME}.predict].
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+h(2, "update") #{CLASSNAME}.update
 | 
				
			||||||
 | 
					    +tag method
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					p
 | 
				
			||||||
 | 
					    |  Learn from a batch of documents and gold-standard information, updating
 | 
				
			||||||
 | 
					    |  the pipe's model. Delegates to #[code #{CLASSNAME}.predict] and
 | 
				
			||||||
 | 
					    |  #[code #{CLASSNAME}.get_loss].
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+aside-code("Example").
 | 
				
			||||||
 | 
					    #{VARNAME} = #{CLASSNAME}(nlp.vocab)
 | 
				
			||||||
 | 
					    losses = {}
 | 
				
			||||||
 | 
					    optimizer = nlp.begin_training()
 | 
				
			||||||
 | 
					    #{VARNAME}.update([doc1, doc2], [gold1, gold2], losses=losses, sgd=optimizer)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+table(["Name", "Type", "Description"])
 | 
				
			||||||
 | 
					    +row
 | 
				
			||||||
 | 
					        +cell #[code docs]
 | 
				
			||||||
 | 
					        +cell iterable
 | 
				
			||||||
 | 
					        +cell A batch of documents to learn from.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    +row
 | 
				
			||||||
 | 
					        +cell #[code golds]
 | 
				
			||||||
 | 
					        +cell iterable
 | 
				
			||||||
 | 
					        +cell The gold-standard data. Must have the same length as #[code docs].
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    +row
 | 
				
			||||||
 | 
					        +cell #[code drop]
 | 
				
			||||||
 | 
					        +cell int
 | 
				
			||||||
 | 
					        +cell The dropout rate.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    +row
 | 
				
			||||||
 | 
					        +cell #[code sgd]
 | 
				
			||||||
 | 
					        +cell callable
 | 
				
			||||||
 | 
					        +cell
 | 
				
			||||||
 | 
					            |  The optimizer. Should take two arguments #[code weights] and
 | 
				
			||||||
 | 
					            |  #[code gradient], and an optional ID.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    +row
 | 
				
			||||||
 | 
					        +cell #[code losses]
 | 
				
			||||||
 | 
					        +cell dict
 | 
				
			||||||
 | 
					        +cell
 | 
				
			||||||
 | 
					            |  Optional record of the loss during training. The value keyed by
 | 
				
			||||||
 | 
					            |  the model's name is updated.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+h(2, "get_loss") #{CLASSNAME}.get_loss
 | 
				
			||||||
 | 
					    +tag method
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					p
 | 
				
			||||||
 | 
					    |  Find the loss and gradient of loss for the batch of documents and their
 | 
				
			||||||
 | 
					    |  predicted scores.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+aside-code("Example").
 | 
				
			||||||
 | 
					    #{VARNAME} = #{CLASSNAME}(nlp.vocab)
 | 
				
			||||||
 | 
					    scores = #{VARNAME}.predict([doc1, doc2])
 | 
				
			||||||
 | 
					    loss, d_loss = #{VARNAME}.get_loss([doc1, doc2], [gold1, gold2], scores)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+table(["Name", "Type", "Description"])
 | 
				
			||||||
 | 
					    +row
 | 
				
			||||||
 | 
					        +cell #[code docs]
 | 
				
			||||||
 | 
					        +cell iterable
 | 
				
			||||||
 | 
					        +cell The batch of documents.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    +row
 | 
				
			||||||
 | 
					        +cell #[code golds]
 | 
				
			||||||
 | 
					        +cell iterable
 | 
				
			||||||
 | 
					        +cell The gold-standard data. Must have the same length as #[code docs].
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    +row
 | 
				
			||||||
 | 
					        +cell #[code scores]
 | 
				
			||||||
 | 
					        +cell -
 | 
				
			||||||
 | 
					        +cell Scores representing the model's predictions.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    +row("foot")
 | 
				
			||||||
 | 
					        +cell returns
 | 
				
			||||||
 | 
					        +cell tuple
 | 
				
			||||||
 | 
					        +cell The loss and the gradient, i.e. #[code (loss, gradient)].
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+h(2, "begin_training") #{CLASSNAME}.begin_training
 | 
				
			||||||
 | 
					    +tag method
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					p
 | 
				
			||||||
 | 
					    |  Initialize the pipe for training, using data exampes if available. If no
 | 
				
			||||||
 | 
					    |  model has been initialized yet, the model is added.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+aside-code("Example").
 | 
				
			||||||
 | 
					    #{VARNAME} = #{CLASSNAME}(nlp.vocab)
 | 
				
			||||||
 | 
					    nlp.pipeline.append(#{VARNAME})
 | 
				
			||||||
 | 
					    #{VARNAME}.begin_training(pipeline=nlp.pipeline)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+table(["Name", "Type", "Description"])
 | 
				
			||||||
 | 
					    +row
 | 
				
			||||||
 | 
					        +cell #[code gold_tuples]
 | 
				
			||||||
 | 
					        +cell iterable
 | 
				
			||||||
 | 
					        +cell
 | 
				
			||||||
 | 
					            |  Optional gold-standard annotations from which to construct
 | 
				
			||||||
 | 
					            |  #[+api("goldparse") #[code GoldParse]] objects.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    +row
 | 
				
			||||||
 | 
					        +cell #[code pipeline]
 | 
				
			||||||
 | 
					        +cell list
 | 
				
			||||||
 | 
					        +cell
 | 
				
			||||||
 | 
					            |  Optional list of #[+api("pipe") #[code Pipe]] components that
 | 
				
			||||||
 | 
					            |  this component is part of.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+h(2, "use_params") #{CLASSNAME}.use_params
 | 
				
			||||||
 | 
					    +tag method
 | 
				
			||||||
 | 
					    +tag contextmanager
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					p Modify the pipe's model, to use the given parameter values.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+aside-code("Example").
 | 
				
			||||||
 | 
					    #{VARNAME} = #{CLASSNAME}(nlp.vocab)
 | 
				
			||||||
 | 
					    with #{VARNAME}.use_params():
 | 
				
			||||||
 | 
					        #{VARNAME}.to_disk('/best_model')
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+table(["Name", "Type", "Description"])
 | 
				
			||||||
 | 
					    +row
 | 
				
			||||||
 | 
					        +cell #[code params]
 | 
				
			||||||
 | 
					        +cell -
 | 
				
			||||||
 | 
					        +cell
 | 
				
			||||||
 | 
					            |  The parameter values to use in the model. At the end of the
 | 
				
			||||||
 | 
					            |  context, the original parameters are restored.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+h(2, "to_disk") #{CLASSNAME}.to_disk
 | 
				
			||||||
 | 
					    +tag method
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					p Serialize the pipe to disk.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+aside-code("Example").
 | 
				
			||||||
 | 
					    #{VARNAME} = #{CLASSNAME}(nlp.vocab)
 | 
				
			||||||
 | 
					    #{VARNAME}.to_disk('/path/to/#{VARNAME}')
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+table(["Name", "Type", "Description"])
 | 
				
			||||||
 | 
					    +row
 | 
				
			||||||
 | 
					        +cell #[code path]
 | 
				
			||||||
 | 
					        +cell unicode or #[code Path]
 | 
				
			||||||
 | 
					        +cell
 | 
				
			||||||
 | 
					            |  A path to a directory, which will be created if it doesn't exist.
 | 
				
			||||||
 | 
					            |  Paths may be either strings or #[code Path]-like objects.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+h(2, "from_disk") #{CLASSNAME}.from_disk
 | 
				
			||||||
 | 
					    +tag method
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					p Load the pipe from disk. Modifies the object in place and returns it.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+aside-code("Example").
 | 
				
			||||||
 | 
					    #{VARNAME} = #{CLASSNAME}(nlp.vocab)
 | 
				
			||||||
 | 
					    #{VARNAME}.from_disk('/path/to/#{VARNAME}')
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+table(["Name", "Type", "Description"])
 | 
				
			||||||
 | 
					    +row
 | 
				
			||||||
 | 
					        +cell #[code path]
 | 
				
			||||||
 | 
					        +cell unicode or #[code Path]
 | 
				
			||||||
 | 
					        +cell
 | 
				
			||||||
 | 
					            |  A path to a directory. Paths may be either strings or
 | 
				
			||||||
 | 
					            |  #[code Path]-like objects.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    +row("foot")
 | 
				
			||||||
 | 
					        +cell returns
 | 
				
			||||||
 | 
					        +cell #[code=CLASSNAME]
 | 
				
			||||||
 | 
					        +cell The modified #[code=CLASSNAME] object.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+h(2, "to_bytes") #{CLASSNAME}.to_bytes
 | 
				
			||||||
 | 
					    +tag method
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+aside-code("example").
 | 
				
			||||||
 | 
					    #{VARNAME} = #{CLASSNAME}(nlp.vocab)
 | 
				
			||||||
 | 
					    #{VARNAME}_bytes = #{VARNAME}.to_bytes()
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					p Serialize the pipe to a bytestring.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+table(["Name", "Type", "Description"])
 | 
				
			||||||
 | 
					    +row
 | 
				
			||||||
 | 
					        +cell #[code **exclude]
 | 
				
			||||||
 | 
					        +cell -
 | 
				
			||||||
 | 
					        +cell Named attributes to prevent from being serialized.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    +row("foot")
 | 
				
			||||||
 | 
					        +cell returns
 | 
				
			||||||
 | 
					        +cell bytes
 | 
				
			||||||
 | 
					        +cell The serialized form of the #[code=CLASSNAME] object.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+h(2, "from_bytes") #{CLASSNAME}.from_bytes
 | 
				
			||||||
 | 
					    +tag method
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					p Load the pipe from a bytestring. Modifies the object in place and returns it.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+aside-code("Example").
 | 
				
			||||||
 | 
					    #{VARNAME}_bytes = #{VARNAME}.to_bytes()
 | 
				
			||||||
 | 
					    #{VARNAME} = #{CLASSNAME}(nlp.vocab)
 | 
				
			||||||
 | 
					    #{VARNAME}.from_bytes(#{VARNAME}_bytes)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+table(["Name", "Type", "Description"])
 | 
				
			||||||
 | 
					    +row
 | 
				
			||||||
 | 
					        +cell #[code bytes_data]
 | 
				
			||||||
 | 
					        +cell bytes
 | 
				
			||||||
 | 
					        +cell The data to load from.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    +row
 | 
				
			||||||
 | 
					        +cell #[code **exclude]
 | 
				
			||||||
 | 
					        +cell -
 | 
				
			||||||
 | 
					        +cell Named attributes to prevent from being loaded.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    +row("foot")
 | 
				
			||||||
 | 
					        +cell returns
 | 
				
			||||||
 | 
					        +cell #[code=CLASSNAME]
 | 
				
			||||||
 | 
					        +cell The #[code=CLASSNAME] object.
 | 
				
			||||||
| 
						 | 
					@ -1,6 +1,6 @@
 | 
				
			||||||
//- 💫 DOCS > API > SPAN
 | 
					//- 💫 DOCS > API > SPAN
 | 
				
			||||||
 | 
					
 | 
				
			||||||
include ../../_includes/_mixins
 | 
					include ../_includes/_mixins
 | 
				
			||||||
 | 
					
 | 
				
			||||||
p A slice from a #[+api("doc") #[code Doc]] object.
 | 
					p A slice from a #[+api("doc") #[code Doc]] object.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
| 
						 | 
					@ -40,7 +40,7 @@ p Create a Span object from the #[code slice doc[start : end]].
 | 
				
			||||||
        +cell #[code.u-break numpy.ndarray[ndim=1, dtype='float32']]
 | 
					        +cell #[code.u-break numpy.ndarray[ndim=1, dtype='float32']]
 | 
				
			||||||
        +cell A meaning representation of the span.
 | 
					        +cell A meaning representation of the span.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell returns
 | 
					        +cell returns
 | 
				
			||||||
        +cell #[code Span]
 | 
					        +cell #[code Span]
 | 
				
			||||||
        +cell The newly constructed object.
 | 
					        +cell The newly constructed object.
 | 
				
			||||||
| 
						 | 
					@ -61,7 +61,7 @@ p Get a #[code Token] object.
 | 
				
			||||||
        +cell int
 | 
					        +cell int
 | 
				
			||||||
        +cell The index of the token within the span.
 | 
					        +cell The index of the token within the span.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell returns
 | 
					        +cell returns
 | 
				
			||||||
        +cell #[code Token]
 | 
					        +cell #[code Token]
 | 
				
			||||||
        +cell The token at #[code span[i]].
 | 
					        +cell The token at #[code span[i]].
 | 
				
			||||||
| 
						 | 
					@ -79,7 +79,7 @@ p Get a #[code Span] object.
 | 
				
			||||||
        +cell tuple
 | 
					        +cell tuple
 | 
				
			||||||
        +cell The slice of the span to get.
 | 
					        +cell The slice of the span to get.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell returns
 | 
					        +cell returns
 | 
				
			||||||
        +cell #[code Span]
 | 
					        +cell #[code Span]
 | 
				
			||||||
        +cell The span at #[code span[start : end]].
 | 
					        +cell The span at #[code span[start : end]].
 | 
				
			||||||
| 
						 | 
					@ -95,7 +95,7 @@ p Iterate over #[code Token] objects.
 | 
				
			||||||
    assert [t.text for t in span] == ['it', 'back', '!']
 | 
					    assert [t.text for t in span] == ['it', 'back', '!']
 | 
				
			||||||
 | 
					
 | 
				
			||||||
+table(["Name", "Type", "Description"])
 | 
					+table(["Name", "Type", "Description"])
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell yields
 | 
					        +cell yields
 | 
				
			||||||
        +cell #[code Token]
 | 
					        +cell #[code Token]
 | 
				
			||||||
        +cell A #[code Token] object.
 | 
					        +cell A #[code Token] object.
 | 
				
			||||||
| 
						 | 
					@ -111,7 +111,7 @@ p Get the number of tokens in the span.
 | 
				
			||||||
    assert len(span) == 3
 | 
					    assert len(span) == 3
 | 
				
			||||||
 | 
					
 | 
				
			||||||
+table(["Name", "Type", "Description"])
 | 
					+table(["Name", "Type", "Description"])
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell returns
 | 
					        +cell returns
 | 
				
			||||||
        +cell int
 | 
					        +cell int
 | 
				
			||||||
        +cell The number of tokens in the span.
 | 
					        +cell The number of tokens in the span.
 | 
				
			||||||
| 
						 | 
					@ -140,7 +140,7 @@ p
 | 
				
			||||||
            |  The object to compare with. By default, accepts #[code Doc],
 | 
					            |  The object to compare with. By default, accepts #[code Doc],
 | 
				
			||||||
            |  #[code Span], #[code Token] and #[code Lexeme] objects.
 | 
					            |  #[code Span], #[code Token] and #[code Lexeme] objects.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell returns
 | 
					        +cell returns
 | 
				
			||||||
        +cell float
 | 
					        +cell float
 | 
				
			||||||
        +cell A scalar similarity score. Higher is more similar.
 | 
					        +cell A scalar similarity score. Higher is more similar.
 | 
				
			||||||
| 
						 | 
					@ -167,7 +167,7 @@ p
 | 
				
			||||||
        +cell list
 | 
					        +cell list
 | 
				
			||||||
        +cell A list of attribute ID ints.
 | 
					        +cell A list of attribute ID ints.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell returns
 | 
					        +cell returns
 | 
				
			||||||
        +cell #[code.u-break numpy.ndarray[long, ndim=2]]
 | 
					        +cell #[code.u-break numpy.ndarray[long, ndim=2]]
 | 
				
			||||||
        +cell
 | 
					        +cell
 | 
				
			||||||
| 
						 | 
					@ -194,7 +194,7 @@ p Retokenize the document, such that the span is merged into a single token.
 | 
				
			||||||
            |  Attributes to assign to the merged token. By default, attributes
 | 
					            |  Attributes to assign to the merged token. By default, attributes
 | 
				
			||||||
            |  are inherited from the syntactic root token of the span.
 | 
					            |  are inherited from the syntactic root token of the span.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell returns
 | 
					        +cell returns
 | 
				
			||||||
        +cell #[code Token]
 | 
					        +cell #[code Token]
 | 
				
			||||||
        +cell The newly merged token.
 | 
					        +cell The newly merged token.
 | 
				
			||||||
| 
						 | 
					@ -216,7 +216,7 @@ p
 | 
				
			||||||
    assert new_york.root.text == 'York'
 | 
					    assert new_york.root.text == 'York'
 | 
				
			||||||
 | 
					
 | 
				
			||||||
+table(["Name", "Type", "Description"])
 | 
					+table(["Name", "Type", "Description"])
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell returns
 | 
					        +cell returns
 | 
				
			||||||
        +cell #[code Token]
 | 
					        +cell #[code Token]
 | 
				
			||||||
        +cell The root token.
 | 
					        +cell The root token.
 | 
				
			||||||
| 
						 | 
					@ -233,7 +233,7 @@ p Tokens that are to the left of the span, whose head is within the span.
 | 
				
			||||||
    assert lefts == [u'New']
 | 
					    assert lefts == [u'New']
 | 
				
			||||||
 | 
					
 | 
				
			||||||
+table(["Name", "Type", "Description"])
 | 
					+table(["Name", "Type", "Description"])
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell yields
 | 
					        +cell yields
 | 
				
			||||||
        +cell #[code Token]
 | 
					        +cell #[code Token]
 | 
				
			||||||
        +cell A left-child of a token of the span.
 | 
					        +cell A left-child of a token of the span.
 | 
				
			||||||
| 
						 | 
					@ -250,7 +250,7 @@ p Tokens that are to the right of the span, whose head is within the span.
 | 
				
			||||||
    assert rights == [u'in']
 | 
					    assert rights == [u'in']
 | 
				
			||||||
 | 
					
 | 
				
			||||||
+table(["Name", "Type", "Description"])
 | 
					+table(["Name", "Type", "Description"])
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell yields
 | 
					        +cell yields
 | 
				
			||||||
        +cell #[code Token]
 | 
					        +cell #[code Token]
 | 
				
			||||||
        +cell A right-child of a token of the span.
 | 
					        +cell A right-child of a token of the span.
 | 
				
			||||||
| 
						 | 
					@ -267,7 +267,7 @@ p Tokens that descend from tokens in the span, but fall outside it.
 | 
				
			||||||
    assert subtree == [u'Give', u'it', u'back', u'!']
 | 
					    assert subtree == [u'Give', u'it', u'back', u'!']
 | 
				
			||||||
 | 
					
 | 
				
			||||||
+table(["Name", "Type", "Description"])
 | 
					+table(["Name", "Type", "Description"])
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell yields
 | 
					        +cell yields
 | 
				
			||||||
        +cell #[code Token]
 | 
					        +cell #[code Token]
 | 
				
			||||||
        +cell A descendant of a token within the span.
 | 
					        +cell A descendant of a token within the span.
 | 
				
			||||||
| 
						 | 
					@ -285,7 +285,7 @@ p
 | 
				
			||||||
    assert doc[1:].has_vector
 | 
					    assert doc[1:].has_vector
 | 
				
			||||||
 | 
					
 | 
				
			||||||
+table(["Name", "Type", "Description"])
 | 
					+table(["Name", "Type", "Description"])
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell returns
 | 
					        +cell returns
 | 
				
			||||||
        +cell bool
 | 
					        +cell bool
 | 
				
			||||||
        +cell Whether the span has a vector data attached.
 | 
					        +cell Whether the span has a vector data attached.
 | 
				
			||||||
| 
						 | 
					@ -304,7 +304,7 @@ p
 | 
				
			||||||
    assert doc[1:].vector.shape == (300,)
 | 
					    assert doc[1:].vector.shape == (300,)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
+table(["Name", "Type", "Description"])
 | 
					+table(["Name", "Type", "Description"])
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell returns
 | 
					        +cell returns
 | 
				
			||||||
        +cell #[code.u-break numpy.ndarray[ndim=1, dtype='float32']]
 | 
					        +cell #[code.u-break numpy.ndarray[ndim=1, dtype='float32']]
 | 
				
			||||||
        +cell A 1D numpy array representing the span's semantics.
 | 
					        +cell A 1D numpy array representing the span's semantics.
 | 
				
			||||||
| 
						 | 
					@ -323,7 +323,7 @@ p
 | 
				
			||||||
    assert doc[1:].vector_norm != doc[2:].vector_norm
 | 
					    assert doc[1:].vector_norm != doc[2:].vector_norm
 | 
				
			||||||
 | 
					
 | 
				
			||||||
+table(["Name", "Type", "Description"])
 | 
					+table(["Name", "Type", "Description"])
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell returns
 | 
					        +cell returns
 | 
				
			||||||
        +cell float
 | 
					        +cell float
 | 
				
			||||||
        +cell The L2 norm of the vector representation.
 | 
					        +cell The L2 norm of the vector representation.
 | 
				
			||||||
| 
						 | 
					@ -1,6 +1,6 @@
 | 
				
			||||||
//- 💫 DOCS > API > STRINGSTORE
 | 
					//- 💫 DOCS > API > STRINGSTORE
 | 
				
			||||||
 | 
					
 | 
				
			||||||
include ../../_includes/_mixins
 | 
					include ../_includes/_mixins
 | 
				
			||||||
 | 
					
 | 
				
			||||||
p
 | 
					p
 | 
				
			||||||
    |  Look up strings by 64-bit hashes. As of v2.0, spaCy uses hash values
 | 
					    |  Look up strings by 64-bit hashes. As of v2.0, spaCy uses hash values
 | 
				
			||||||
| 
						 | 
					@ -23,7 +23,7 @@ p
 | 
				
			||||||
        +cell iterable
 | 
					        +cell iterable
 | 
				
			||||||
        +cell A sequence of unicode strings to add to the store.
 | 
					        +cell A sequence of unicode strings to add to the store.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell returns
 | 
					        +cell returns
 | 
				
			||||||
        +cell #[code StringStore]
 | 
					        +cell #[code StringStore]
 | 
				
			||||||
        +cell The newly constructed object.
 | 
					        +cell The newly constructed object.
 | 
				
			||||||
| 
						 | 
					@ -38,7 +38,7 @@ p Get the number of strings in the store.
 | 
				
			||||||
    assert len(stringstore) == 2
 | 
					    assert len(stringstore) == 2
 | 
				
			||||||
 | 
					
 | 
				
			||||||
+table(["Name", "Type", "Description"])
 | 
					+table(["Name", "Type", "Description"])
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell returns
 | 
					        +cell returns
 | 
				
			||||||
        +cell int
 | 
					        +cell int
 | 
				
			||||||
        +cell The number of strings in the store.
 | 
					        +cell The number of strings in the store.
 | 
				
			||||||
| 
						 | 
					@ -60,7 +60,7 @@ p Retrieve a string from a given hash, or vice versa.
 | 
				
			||||||
        +cell bytes, unicode or uint64
 | 
					        +cell bytes, unicode or uint64
 | 
				
			||||||
        +cell The value to encode.
 | 
					        +cell The value to encode.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell returns
 | 
					        +cell returns
 | 
				
			||||||
        +cell unicode or int
 | 
					        +cell unicode or int
 | 
				
			||||||
        +cell The value to be retrieved.
 | 
					        +cell The value to be retrieved.
 | 
				
			||||||
| 
						 | 
					@ -81,7 +81,7 @@ p Check whether a string is in the store.
 | 
				
			||||||
        +cell unicode
 | 
					        +cell unicode
 | 
				
			||||||
        +cell The string to check.
 | 
					        +cell The string to check.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell returns
 | 
					        +cell returns
 | 
				
			||||||
        +cell bool
 | 
					        +cell bool
 | 
				
			||||||
        +cell Whether the store contains the string.
 | 
					        +cell Whether the store contains the string.
 | 
				
			||||||
| 
						 | 
					@ -100,7 +100,7 @@ p
 | 
				
			||||||
    assert all_strings == [u'apple', u'orange']
 | 
					    assert all_strings == [u'apple', u'orange']
 | 
				
			||||||
 | 
					
 | 
				
			||||||
+table(["Name", "Type", "Description"])
 | 
					+table(["Name", "Type", "Description"])
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell yields
 | 
					        +cell yields
 | 
				
			||||||
        +cell unicode
 | 
					        +cell unicode
 | 
				
			||||||
        +cell A string in the store.
 | 
					        +cell A string in the store.
 | 
				
			||||||
| 
						 | 
					@ -125,7 +125,7 @@ p Add a string to the #[code StringStore].
 | 
				
			||||||
        +cell unicode
 | 
					        +cell unicode
 | 
				
			||||||
        +cell The string to add.
 | 
					        +cell The string to add.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell returns
 | 
					        +cell returns
 | 
				
			||||||
        +cell uint64
 | 
					        +cell uint64
 | 
				
			||||||
        +cell The string's hash value.
 | 
					        +cell The string's hash value.
 | 
				
			||||||
| 
						 | 
					@ -166,7 +166,7 @@ p Loads state from a directory. Modifies the object in place and returns it.
 | 
				
			||||||
            |  A path to a directory. Paths may be either strings or
 | 
					            |  A path to a directory. Paths may be either strings or
 | 
				
			||||||
            |  #[code Path]-like objects.
 | 
					            |  #[code Path]-like objects.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell returns
 | 
					        +cell returns
 | 
				
			||||||
        +cell #[code StringStore]
 | 
					        +cell #[code StringStore]
 | 
				
			||||||
        +cell The modified #[code StringStore] object.
 | 
					        +cell The modified #[code StringStore] object.
 | 
				
			||||||
| 
						 | 
					@ -185,7 +185,7 @@ p Serialize the current state to a binary string.
 | 
				
			||||||
        +cell -
 | 
					        +cell -
 | 
				
			||||||
        +cell Named attributes to prevent from being serialized.
 | 
					        +cell Named attributes to prevent from being serialized.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell returns
 | 
					        +cell returns
 | 
				
			||||||
        +cell bytes
 | 
					        +cell bytes
 | 
				
			||||||
        +cell The serialized form of the #[code StringStore] object.
 | 
					        +cell The serialized form of the #[code StringStore] object.
 | 
				
			||||||
| 
						 | 
					@ -211,7 +211,7 @@ p Load state from a binary string.
 | 
				
			||||||
        +cell -
 | 
					        +cell -
 | 
				
			||||||
        +cell Named attributes to prevent from being loaded.
 | 
					        +cell Named attributes to prevent from being loaded.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell returns
 | 
					        +cell returns
 | 
				
			||||||
        +cell #[code StringStore]
 | 
					        +cell #[code StringStore]
 | 
				
			||||||
        +cell The #[code StringStore] object.
 | 
					        +cell The #[code StringStore] object.
 | 
				
			||||||
| 
						 | 
					@ -233,7 +233,7 @@ p Get a 64-bit hash for a given string.
 | 
				
			||||||
        +cell unicode
 | 
					        +cell unicode
 | 
				
			||||||
        +cell The string to hash.
 | 
					        +cell The string to hash.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell returns
 | 
					        +cell returns
 | 
				
			||||||
        +cell uint64
 | 
					        +cell uint64
 | 
				
			||||||
        +cell The hash.
 | 
					        +cell The hash.
 | 
				
			||||||
							
								
								
									
										5
									
								
								website/api/tagger.jade
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										5
									
								
								website/api/tagger.jade
									
									
									
									
									
										Normal file
									
								
							| 
						 | 
					@ -0,0 +1,5 @@
 | 
				
			||||||
 | 
					//- 💫 DOCS > API > TAGGER
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					include ../_includes/_mixins
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					!=partial("pipe", { subclass: "Tagger", pipeline_id: "tagger" })
 | 
				
			||||||
							
								
								
									
										5
									
								
								website/api/tensorizer.jade
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										5
									
								
								website/api/tensorizer.jade
									
									
									
									
									
										Normal file
									
								
							| 
						 | 
					@ -0,0 +1,5 @@
 | 
				
			||||||
 | 
					//- 💫 DOCS > API > TENSORIZER
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					include ../_includes/_mixins
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					!=partial("pipe", { subclass: "Tensorizer", pipeline_id: "tensorizer" })
 | 
				
			||||||
							
								
								
									
										19
									
								
								website/api/textcategorizer.jade
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										19
									
								
								website/api/textcategorizer.jade
									
									
									
									
									
										Normal file
									
								
							| 
						 | 
					@ -0,0 +1,19 @@
 | 
				
			||||||
 | 
					//- 💫 DOCS > API > TEXTCATEGORIZER
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					include ../_includes/_mixins
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					p
 | 
				
			||||||
 | 
					    |  The model supports classification with multiple, non-mutually exclusive
 | 
				
			||||||
 | 
					    |  labels. You can change the model architecture rather easily, but by
 | 
				
			||||||
 | 
					    |  default, the #[code TextCategorizer] class uses a convolutional
 | 
				
			||||||
 | 
					    |  neural network to assign position-sensitive vectors to each word in the
 | 
				
			||||||
 | 
					    |  document. This step is similar to the #[+api("tensorizer") #[code Tensorizer]]
 | 
				
			||||||
 | 
					    |  component, but the #[code TextCategorizer] uses its own CNN model, to
 | 
				
			||||||
 | 
					    |  avoid sharing weights with the other pipeline components. The document
 | 
				
			||||||
 | 
					    |  tensor is then
 | 
				
			||||||
 | 
					    |  summarized by concatenating max and mean pooling, and a multilayer
 | 
				
			||||||
 | 
					    |  perceptron is used to predict an output vector of length #[code nr_class],
 | 
				
			||||||
 | 
					    |  before a logistic activation is applied elementwise. The value of each
 | 
				
			||||||
 | 
					    |  output neuron is the probability that some class is present.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					!=partial("pipe", { subclass: "TextCategorizer", short: "textcat", pipeline_id: "textcat" })
 | 
				
			||||||
| 
						 | 
					@ -1,6 +1,6 @@
 | 
				
			||||||
//- 💫 DOCS > API > TOKEN
 | 
					//- 💫 DOCS > API > TOKEN
 | 
				
			||||||
 | 
					
 | 
				
			||||||
include ../../_includes/_mixins
 | 
					include ../_includes/_mixins
 | 
				
			||||||
 | 
					
 | 
				
			||||||
p An individual token — i.e. a word, punctuation symbol, whitespace, etc.
 | 
					p An individual token — i.e. a word, punctuation symbol, whitespace, etc.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
| 
						 | 
					@ -30,7 +30,7 @@ p Construct a #[code Token] object.
 | 
				
			||||||
        +cell int
 | 
					        +cell int
 | 
				
			||||||
        +cell The index of the token within the document.
 | 
					        +cell The index of the token within the document.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell returns
 | 
					        +cell returns
 | 
				
			||||||
        +cell #[code Token]
 | 
					        +cell #[code Token]
 | 
				
			||||||
        +cell The newly constructed object.
 | 
					        +cell The newly constructed object.
 | 
				
			||||||
| 
						 | 
					@ -46,7 +46,7 @@ p The number of unicode characters in the token, i.e. #[code token.text].
 | 
				
			||||||
    assert len(token) == 4
 | 
					    assert len(token) == 4
 | 
				
			||||||
 | 
					
 | 
				
			||||||
+table(["Name", "Type", "Description"])
 | 
					+table(["Name", "Type", "Description"])
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell returns
 | 
					        +cell returns
 | 
				
			||||||
        +cell int
 | 
					        +cell int
 | 
				
			||||||
        +cell The number of unicode characters in the token.
 | 
					        +cell The number of unicode characters in the token.
 | 
				
			||||||
| 
						 | 
					@ -68,7 +68,7 @@ p Check the value of a boolean flag.
 | 
				
			||||||
        +cell int
 | 
					        +cell int
 | 
				
			||||||
        +cell The attribute ID of the flag to check.
 | 
					        +cell The attribute ID of the flag to check.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell returns
 | 
					        +cell returns
 | 
				
			||||||
        +cell bool
 | 
					        +cell bool
 | 
				
			||||||
        +cell Whether the flag is set.
 | 
					        +cell Whether the flag is set.
 | 
				
			||||||
| 
						 | 
					@ -93,7 +93,7 @@ p Compute a semantic similarity estimate. Defaults to cosine over vectors.
 | 
				
			||||||
            |  The object to compare with. By default, accepts #[code Doc],
 | 
					            |  The object to compare with. By default, accepts #[code Doc],
 | 
				
			||||||
            |  #[code Span], #[code Token] and #[code Lexeme] objects.
 | 
					            |  #[code Span], #[code Token] and #[code Lexeme] objects.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell returns
 | 
					        +cell returns
 | 
				
			||||||
        +cell float
 | 
					        +cell float
 | 
				
			||||||
        +cell A scalar similarity score. Higher is more similar.
 | 
					        +cell A scalar similarity score. Higher is more similar.
 | 
				
			||||||
| 
						 | 
					@ -114,7 +114,7 @@ p Get a neighboring token.
 | 
				
			||||||
        +cell int
 | 
					        +cell int
 | 
				
			||||||
        +cell The relative position of the token to get. Defaults to #[code 1].
 | 
					        +cell The relative position of the token to get. Defaults to #[code 1].
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell returns
 | 
					        +cell returns
 | 
				
			||||||
        +cell #[code Token]
 | 
					        +cell #[code Token]
 | 
				
			||||||
        +cell The token at position #[code self.doc[self.i+i]].
 | 
					        +cell The token at position #[code self.doc[self.i+i]].
 | 
				
			||||||
| 
						 | 
					@ -139,7 +139,7 @@ p
 | 
				
			||||||
        +cell #[code Token]
 | 
					        +cell #[code Token]
 | 
				
			||||||
        +cell Another token.
 | 
					        +cell Another token.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell returns
 | 
					        +cell returns
 | 
				
			||||||
        +cell bool
 | 
					        +cell bool
 | 
				
			||||||
        +cell Whether this token is the ancestor of the descendant.
 | 
					        +cell Whether this token is the ancestor of the descendant.
 | 
				
			||||||
| 
						 | 
					@ -158,7 +158,7 @@ p The rightmost token of this token's syntactic descendants.
 | 
				
			||||||
    assert [t.text for t in he_ancestors] == [u'pleaded']
 | 
					    assert [t.text for t in he_ancestors] == [u'pleaded']
 | 
				
			||||||
 | 
					
 | 
				
			||||||
+table(["Name", "Type", "Description"])
 | 
					+table(["Name", "Type", "Description"])
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell yields
 | 
					        +cell yields
 | 
				
			||||||
        +cell #[code Token]
 | 
					        +cell #[code Token]
 | 
				
			||||||
        +cell
 | 
					        +cell
 | 
				
			||||||
| 
						 | 
					@ -177,7 +177,7 @@ p A sequence of coordinated tokens, including the token itself.
 | 
				
			||||||
    assert [t.text for t in apples_conjuncts] == [u'oranges']
 | 
					    assert [t.text for t in apples_conjuncts] == [u'oranges']
 | 
				
			||||||
 | 
					
 | 
				
			||||||
+table(["Name", "Type", "Description"])
 | 
					+table(["Name", "Type", "Description"])
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell yields
 | 
					        +cell yields
 | 
				
			||||||
        +cell #[code Token]
 | 
					        +cell #[code Token]
 | 
				
			||||||
        +cell A coordinated token.
 | 
					        +cell A coordinated token.
 | 
				
			||||||
| 
						 | 
					@ -194,7 +194,7 @@ p A sequence of the token's immediate syntactic children.
 | 
				
			||||||
    assert [t.text for t in give_children] == [u'it', u'back', u'!']
 | 
					    assert [t.text for t in give_children] == [u'it', u'back', u'!']
 | 
				
			||||||
 | 
					
 | 
				
			||||||
+table(["Name", "Type", "Description"])
 | 
					+table(["Name", "Type", "Description"])
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell yields
 | 
					        +cell yields
 | 
				
			||||||
        +cell #[code Token]
 | 
					        +cell #[code Token]
 | 
				
			||||||
        +cell A child token such that #[code child.head==self].
 | 
					        +cell A child token such that #[code child.head==self].
 | 
				
			||||||
| 
						 | 
					@ -211,7 +211,7 @@ p A sequence of all the token's syntactic descendents.
 | 
				
			||||||
    assert [t.text for t in give_subtree] == [u'Give', u'it', u'back', u'!']
 | 
					    assert [t.text for t in give_subtree] == [u'Give', u'it', u'back', u'!']
 | 
				
			||||||
 | 
					
 | 
				
			||||||
+table(["Name", "Type", "Description"])
 | 
					+table(["Name", "Type", "Description"])
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell yields
 | 
					        +cell yields
 | 
				
			||||||
        +cell #[code Token]
 | 
					        +cell #[code Token]
 | 
				
			||||||
        +cell A descendant token such that #[code self.is_ancestor(descendant)].
 | 
					        +cell A descendant token such that #[code self.is_ancestor(descendant)].
 | 
				
			||||||
| 
						 | 
					@ -230,7 +230,7 @@ p
 | 
				
			||||||
    assert apples.has_vector
 | 
					    assert apples.has_vector
 | 
				
			||||||
 | 
					
 | 
				
			||||||
+table(["Name", "Type", "Description"])
 | 
					+table(["Name", "Type", "Description"])
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell returns
 | 
					        +cell returns
 | 
				
			||||||
        +cell bool
 | 
					        +cell bool
 | 
				
			||||||
        +cell Whether the token has a vector data attached.
 | 
					        +cell Whether the token has a vector data attached.
 | 
				
			||||||
| 
						 | 
					@ -248,7 +248,7 @@ p A real-valued meaning representation.
 | 
				
			||||||
    assert apples.vector.shape == (300,)
 | 
					    assert apples.vector.shape == (300,)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
+table(["Name", "Type", "Description"])
 | 
					+table(["Name", "Type", "Description"])
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell returns
 | 
					        +cell returns
 | 
				
			||||||
        +cell #[code.u-break numpy.ndarray[ndim=1, dtype='float32']]
 | 
					        +cell #[code.u-break numpy.ndarray[ndim=1, dtype='float32']]
 | 
				
			||||||
        +cell A 1D numpy array representing the token's semantics.
 | 
					        +cell A 1D numpy array representing the token's semantics.
 | 
				
			||||||
| 
						 | 
					@ -268,7 +268,7 @@ p The L2 norm of the token's vector representation.
 | 
				
			||||||
    assert apples.vector_norm != pasta.vector_norm
 | 
					    assert apples.vector_norm != pasta.vector_norm
 | 
				
			||||||
 | 
					
 | 
				
			||||||
+table(["Name", "Type", "Description"])
 | 
					+table(["Name", "Type", "Description"])
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell returns
 | 
					        +cell returns
 | 
				
			||||||
        +cell float
 | 
					        +cell float
 | 
				
			||||||
        +cell The L2 norm of the vector representation.
 | 
					        +cell The L2 norm of the vector representation.
 | 
				
			||||||
| 
						 | 
					@ -280,20 +280,29 @@ p The L2 norm of the token's vector representation.
 | 
				
			||||||
        +cell #[code text]
 | 
					        +cell #[code text]
 | 
				
			||||||
        +cell unicode
 | 
					        +cell unicode
 | 
				
			||||||
        +cell Verbatim text content.
 | 
					        +cell Verbatim text content.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    +row
 | 
					    +row
 | 
				
			||||||
        +cell #[code text_with_ws]
 | 
					        +cell #[code text_with_ws]
 | 
				
			||||||
        +cell unicode
 | 
					        +cell unicode
 | 
				
			||||||
        +cell Text content, with trailing space character if present.
 | 
					        +cell Text content, with trailing space character if present.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    +row
 | 
					 | 
				
			||||||
        +cell #[code whitespace]
 | 
					 | 
				
			||||||
        +cell int
 | 
					 | 
				
			||||||
        +cell Trailing space character if present.
 | 
					 | 
				
			||||||
    +row
 | 
					    +row
 | 
				
			||||||
        +cell #[code whitespace_]
 | 
					        +cell #[code whitespace_]
 | 
				
			||||||
        +cell unicode
 | 
					        +cell unicode
 | 
				
			||||||
        +cell Trailing space character if present.
 | 
					        +cell Trailing space character if present.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    +row
 | 
				
			||||||
 | 
					        +cell #[code orth]
 | 
				
			||||||
 | 
					        +cell int
 | 
				
			||||||
 | 
					        +cell ID of the verbatim text content.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    +row
 | 
				
			||||||
 | 
					        +cell #[code orth_]
 | 
				
			||||||
 | 
					        +cell unicode
 | 
				
			||||||
 | 
					        +cell
 | 
				
			||||||
 | 
					            |  Verbatim text content (identical to #[code Token.text]). Existst
 | 
				
			||||||
 | 
					            |  mostly for consistency with the other attributes.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    +row
 | 
					    +row
 | 
				
			||||||
        +cell #[code vocab]
 | 
					        +cell #[code vocab]
 | 
				
			||||||
        +cell #[code Vocab]
 | 
					        +cell #[code Vocab]
 | 
				
			||||||
| 
						 | 
					@ -1,6 +1,6 @@
 | 
				
			||||||
//- 💫 DOCS > API > TOKENIZER
 | 
					//- 💫 DOCS > API > TOKENIZER
 | 
				
			||||||
 | 
					
 | 
				
			||||||
include ../../_includes/_mixins
 | 
					include ../_includes/_mixins
 | 
				
			||||||
 | 
					
 | 
				
			||||||
p
 | 
					p
 | 
				
			||||||
    |  Segment text, and create #[code Doc] objects with the discovered segment
 | 
					    |  Segment text, and create #[code Doc] objects with the discovered segment
 | 
				
			||||||
| 
						 | 
					@ -57,7 +57,7 @@ p Create a #[code Tokenizer], to create #[code Doc] objects given unicode text.
 | 
				
			||||||
        +cell callable
 | 
					        +cell callable
 | 
				
			||||||
        +cell A boolean function matching strings to be recognised as tokens.
 | 
					        +cell A boolean function matching strings to be recognised as tokens.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell returns
 | 
					        +cell returns
 | 
				
			||||||
        +cell #[code Tokenizer]
 | 
					        +cell #[code Tokenizer]
 | 
				
			||||||
        +cell The newly constructed object.
 | 
					        +cell The newly constructed object.
 | 
				
			||||||
| 
						 | 
					@ -77,7 +77,7 @@ p Tokenize a string.
 | 
				
			||||||
        +cell unicode
 | 
					        +cell unicode
 | 
				
			||||||
        +cell The string to tokenize.
 | 
					        +cell The string to tokenize.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell returns
 | 
					        +cell returns
 | 
				
			||||||
        +cell #[code Doc]
 | 
					        +cell #[code Doc]
 | 
				
			||||||
        +cell A container for linguistic annotations.
 | 
					        +cell A container for linguistic annotations.
 | 
				
			||||||
| 
						 | 
					@ -110,7 +110,7 @@ p Tokenize a stream of texts.
 | 
				
			||||||
            |  The number of threads to use, if the implementation supports
 | 
					            |  The number of threads to use, if the implementation supports
 | 
				
			||||||
            |  multi-threading. The default tokenizer is single-threaded.
 | 
					            |  multi-threading. The default tokenizer is single-threaded.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell yields
 | 
					        +cell yields
 | 
				
			||||||
        +cell #[code Doc]
 | 
					        +cell #[code Doc]
 | 
				
			||||||
        +cell A sequence of Doc objects, in order.
 | 
					        +cell A sequence of Doc objects, in order.
 | 
				
			||||||
| 
						 | 
					@ -126,7 +126,7 @@ p Find internal split points of the string.
 | 
				
			||||||
        +cell unicode
 | 
					        +cell unicode
 | 
				
			||||||
        +cell The string to split.
 | 
					        +cell The string to split.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell returns
 | 
					        +cell returns
 | 
				
			||||||
        +cell list
 | 
					        +cell list
 | 
				
			||||||
        +cell
 | 
					        +cell
 | 
				
			||||||
| 
						 | 
					@ -147,7 +147,7 @@ p
 | 
				
			||||||
        +cell unicode
 | 
					        +cell unicode
 | 
				
			||||||
        +cell The string to segment.
 | 
					        +cell The string to segment.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell returns
 | 
					        +cell returns
 | 
				
			||||||
        +cell int
 | 
					        +cell int
 | 
				
			||||||
        +cell The length of the prefix if present, otherwise #[code None].
 | 
					        +cell The length of the prefix if present, otherwise #[code None].
 | 
				
			||||||
| 
						 | 
					@ -165,7 +165,7 @@ p
 | 
				
			||||||
        +cell unicode
 | 
					        +cell unicode
 | 
				
			||||||
        +cell The string to segment.
 | 
					        +cell The string to segment.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell returns
 | 
					        +cell returns
 | 
				
			||||||
        +cell int / #[code None]
 | 
					        +cell int / #[code None]
 | 
				
			||||||
        +cell The length of the suffix if present, otherwise #[code None].
 | 
					        +cell The length of the suffix if present, otherwise #[code None].
 | 
				
			||||||
| 
						 | 
					@ -176,7 +176,7 @@ p
 | 
				
			||||||
p
 | 
					p
 | 
				
			||||||
    |  Add a special-case tokenization rule. This mechanism is also used to add
 | 
					    |  Add a special-case tokenization rule. This mechanism is also used to add
 | 
				
			||||||
    |  custom tokenizer exceptions to the language data. See the usage guide
 | 
					    |  custom tokenizer exceptions to the language data. See the usage guide
 | 
				
			||||||
    |  on #[+a("/docs/usage/adding-languages#tokenizer-exceptions") adding languages]
 | 
					    |  on #[+a("/usage/adding-languages#tokenizer-exceptions") adding languages]
 | 
				
			||||||
    |  for more details and examples.
 | 
					    |  for more details and examples.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
+aside-code("Example").
 | 
					+aside-code("Example").
 | 
				
			||||||
							
								
								
									
										24
									
								
								website/api/top-level.jade
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										24
									
								
								website/api/top-level.jade
									
									
									
									
									
										Normal file
									
								
							| 
						 | 
					@ -0,0 +1,24 @@
 | 
				
			||||||
 | 
					//- 💫 DOCS > API > TOP-LEVEL
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					include ../_includes/_mixins
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+section("spacy")
 | 
				
			||||||
 | 
					    //-+h(2, "spacy") spaCy
 | 
				
			||||||
 | 
					    //- spacy/__init__.py
 | 
				
			||||||
 | 
					    include _top-level/_spacy
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+section("displacy")
 | 
				
			||||||
 | 
					    +h(2, "displacy", "spacy/displacy") displaCy
 | 
				
			||||||
 | 
					    include _top-level/_displacy
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+section("util")
 | 
				
			||||||
 | 
					    +h(2, "util", "spacy/util.py") Utility functions
 | 
				
			||||||
 | 
					    include _top-level/_util
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+section("compat")
 | 
				
			||||||
 | 
					    +h(2, "compat", "spacy/compaty.py") Compatibility functions
 | 
				
			||||||
 | 
					    include _top-level/_compat
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+section("cli", "spacy/cli")
 | 
				
			||||||
 | 
					    +h(2, "cli") Command line
 | 
				
			||||||
 | 
					    include _top-level/_cli
 | 
				
			||||||
							
								
								
									
										333
									
								
								website/api/vectors.jade
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										333
									
								
								website/api/vectors.jade
									
									
									
									
									
										Normal file
									
								
							| 
						 | 
					@ -0,0 +1,333 @@
 | 
				
			||||||
 | 
					//- 💫 DOCS > API > VECTORS
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					include ../_includes/_mixins
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					p
 | 
				
			||||||
 | 
					    |  Vectors data is kept in the #[code Vectors.data] attribute, which should
 | 
				
			||||||
 | 
					    |  be an instance of #[code numpy.ndarray] (for CPU vectors) or
 | 
				
			||||||
 | 
					    |  #[code cupy.ndarray] (for GPU vectors).
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+h(2, "init") Vectors.__init__
 | 
				
			||||||
 | 
					    +tag method
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					p
 | 
				
			||||||
 | 
					    |  Create a new vector store. To keep the vector table empty, pass
 | 
				
			||||||
 | 
					    |  #[code data_or_width=0]. You can also create the vector table and add
 | 
				
			||||||
 | 
					    |  vectors one by one, or set the vector values directly on initialisation.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+aside-code("Example").
 | 
				
			||||||
 | 
					    from spacy.vectors import Vectors
 | 
				
			||||||
 | 
					    from spacy.strings import StringStore
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    empty_vectors = Vectors(StringStore())
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    vectors = Vectors([u'cat'], 300)
 | 
				
			||||||
 | 
					    vectors[u'cat'] = numpy.random.uniform(-1, 1, (300,))
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    vector_table = numpy.zeros((3, 300), dtype='f')
 | 
				
			||||||
 | 
					    vectors = Vectors(StringStore(), vector_table)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+table(["Name", "Type", "Description"])
 | 
				
			||||||
 | 
					    +row
 | 
				
			||||||
 | 
					        +cell #[code strings]
 | 
				
			||||||
 | 
					        +cell #[code StringStore] or list
 | 
				
			||||||
 | 
					        +cell
 | 
				
			||||||
 | 
					            |  List of strings, or a #[+api("stringstore") #[code StringStore]]
 | 
				
			||||||
 | 
					            |  that maps strings to hash values, and vice versa.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    +row
 | 
				
			||||||
 | 
					        +cell #[code data_or_width]
 | 
				
			||||||
 | 
					        +cell #[code.u-break numpy.ndarray[ndim=1, dtype='float32']] or int
 | 
				
			||||||
 | 
					        +cell Vector data or number of dimensions.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    +row("foot")
 | 
				
			||||||
 | 
					        +cell returns
 | 
				
			||||||
 | 
					        +cell #[code Vectors]
 | 
				
			||||||
 | 
					        +cell The newly created object.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+h(2, "getitem") Vectors.__getitem__
 | 
				
			||||||
 | 
					    +tag method
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					p
 | 
				
			||||||
 | 
					    |  Get a vector by key. If key is a string, it is hashed to an integer ID
 | 
				
			||||||
 | 
					    |  using the #[code Vectors.strings] table. If the integer key is not found
 | 
				
			||||||
 | 
					    |  in the table, a #[code KeyError] is raised.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+aside-code("Example").
 | 
				
			||||||
 | 
					    vectors = Vectors(StringStore(), 300)
 | 
				
			||||||
 | 
					    vectors.add(u'cat', numpy.random.uniform(-1, 1, (300,)))
 | 
				
			||||||
 | 
					    cat_vector = vectors[u'cat']
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+table(["Name", "Type", "Description"])
 | 
				
			||||||
 | 
					    +row
 | 
				
			||||||
 | 
					        +cell #[code key]
 | 
				
			||||||
 | 
					        +cell unicode / int
 | 
				
			||||||
 | 
					        +cell The key to get the vector for.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    +row
 | 
				
			||||||
 | 
					        +cell returns
 | 
				
			||||||
 | 
					        +cell #[code.u-break numpy.ndarray[ndim=1, dtype='float32']]
 | 
				
			||||||
 | 
					        +cell The vector for the key.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+h(2, "setitem") Vectors.__setitem__
 | 
				
			||||||
 | 
					    +tag method
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					p
 | 
				
			||||||
 | 
					    |  Set a vector for the given key. If key is a string, it is hashed to an
 | 
				
			||||||
 | 
					    |  integer ID using the #[code Vectors.strings] table.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+aside-code("Example").
 | 
				
			||||||
 | 
					    vectors = Vectors(StringStore(), 300)
 | 
				
			||||||
 | 
					    vectors[u'cat'] = numpy.random.uniform(-1, 1, (300,))
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+table(["Name", "Type", "Description"])
 | 
				
			||||||
 | 
					    +row
 | 
				
			||||||
 | 
					        +cell #[code key]
 | 
				
			||||||
 | 
					        +cell unicode / int
 | 
				
			||||||
 | 
					        +cell The key to set the vector for.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    +row
 | 
				
			||||||
 | 
					        +cell #[code vector]
 | 
				
			||||||
 | 
					        +cell #[code.u-break numpy.ndarray[ndim=1, dtype='float32']]
 | 
				
			||||||
 | 
					        +cell The vector to set.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+h(2, "iter") Vectors.__iter__
 | 
				
			||||||
 | 
					    +tag method
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					p Yield vectors from the table.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+aside-code("Example").
 | 
				
			||||||
 | 
					    vector_table = numpy.zeros((3, 300), dtype='f')
 | 
				
			||||||
 | 
					    vectors = Vectors(StringStore(), vector_table)
 | 
				
			||||||
 | 
					    for vector in vectors:
 | 
				
			||||||
 | 
					        print(vector)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+table(["Name", "Type", "Description"])
 | 
				
			||||||
 | 
					    +row("foot")
 | 
				
			||||||
 | 
					        +cell yields
 | 
				
			||||||
 | 
					        +cell #[code.u-break numpy.ndarray[ndim=1, dtype='float32']]
 | 
				
			||||||
 | 
					        +cell A vector from the table.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+h(2, "len") Vectors.__len__
 | 
				
			||||||
 | 
					    +tag method
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					p Return the number of vectors that have been assigned.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+aside-code("Example").
 | 
				
			||||||
 | 
					    vector_table = numpy.zeros((3, 300), dtype='f')
 | 
				
			||||||
 | 
					    vectors = Vectors(StringStore(), vector_table)
 | 
				
			||||||
 | 
					    assert len(vectors) == 3
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+table(["Name", "Type", "Description"])
 | 
				
			||||||
 | 
					    +row("foot")
 | 
				
			||||||
 | 
					        +cell returns
 | 
				
			||||||
 | 
					        +cell int
 | 
				
			||||||
 | 
					        +cell The number of vectors in the data.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+h(2, "contains") Vectors.__contains__
 | 
				
			||||||
 | 
					    +tag method
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					p
 | 
				
			||||||
 | 
					    |  Check whether a key has a vector entry in the table. If key is a string,
 | 
				
			||||||
 | 
					    |  it is hashed to an integer ID using the #[code Vectors.strings] table.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+aside-code("Example").
 | 
				
			||||||
 | 
					    vectors = Vectors(StringStore(), 300)
 | 
				
			||||||
 | 
					    vectors.add(u'cat', numpy.random.uniform(-1, 1, (300,)))
 | 
				
			||||||
 | 
					    assert u'cat' in vectors
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+table(["Name", "Type", "Description"])
 | 
				
			||||||
 | 
					    +row
 | 
				
			||||||
 | 
					        +cell #[code key]
 | 
				
			||||||
 | 
					        +cell unicode / int
 | 
				
			||||||
 | 
					        +cell The key to check.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    +row("foot")
 | 
				
			||||||
 | 
					        +cell returns
 | 
				
			||||||
 | 
					        +cell bool
 | 
				
			||||||
 | 
					        +cell Whether the key has a vector entry.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+h(2, "add") Vectors.add
 | 
				
			||||||
 | 
					    +tag method
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					p
 | 
				
			||||||
 | 
					    |  Add a key to the table, optionally setting a vector value as well. If
 | 
				
			||||||
 | 
					    |  key is a string, it is hashed to an integer ID using the
 | 
				
			||||||
 | 
					    |  #[code Vectors.strings] table.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+aside-code("Example").
 | 
				
			||||||
 | 
					    vectors = Vectors(StringStore(), 300)
 | 
				
			||||||
 | 
					    vectors.add(u'cat', numpy.random.uniform(-1, 1, (300,)))
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+table(["Name", "Type", "Description"])
 | 
				
			||||||
 | 
					    +row
 | 
				
			||||||
 | 
					        +cell #[code key]
 | 
				
			||||||
 | 
					        +cell unicode / int
 | 
				
			||||||
 | 
					        +cell The key to add.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    +row
 | 
				
			||||||
 | 
					        +cell #[code vector]
 | 
				
			||||||
 | 
					        +cell #[code.u-break numpy.ndarray[ndim=1, dtype='float32']]
 | 
				
			||||||
 | 
					        +cell An optional vector to add.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+h(2, "items") Vectors.items
 | 
				
			||||||
 | 
					    +tag method
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					p Iterate over #[code (string key, vector)] pairs, in order.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+aside-code("Example").
 | 
				
			||||||
 | 
					    vectors = Vectors(StringStore(), 300)
 | 
				
			||||||
 | 
					    vectors.add(u'cat', numpy.random.uniform(-1, 1, (300,)))
 | 
				
			||||||
 | 
					    for key, vector in vectors.items():
 | 
				
			||||||
 | 
					        print(key, vector)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+table(["Name", "Type", "Description"])
 | 
				
			||||||
 | 
					    +row("foot")
 | 
				
			||||||
 | 
					        +cell yields
 | 
				
			||||||
 | 
					        +cell tuple
 | 
				
			||||||
 | 
					        +cell #[code (string key, vector)] pairs, in order.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+h(2, "shape") Vectors.shape
 | 
				
			||||||
 | 
					    +tag property
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					p
 | 
				
			||||||
 | 
					    |  Get #[code (rows, dims)] tuples of number of rows and number of
 | 
				
			||||||
 | 
					    |  dimensions in the vector table.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+aside-code("Example").
 | 
				
			||||||
 | 
					    vectors = Vectors(StringStore(), 300)
 | 
				
			||||||
 | 
					    vectors.add(u'cat', numpy.random.uniform(-1, 1, (300,)))
 | 
				
			||||||
 | 
					    rows, dims = vectors.shape
 | 
				
			||||||
 | 
					    assert rows == 1
 | 
				
			||||||
 | 
					    assert dims == 300
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+table(["Name", "Type", "Description"])
 | 
				
			||||||
 | 
					    +row("foot")
 | 
				
			||||||
 | 
					        +cell returns
 | 
				
			||||||
 | 
					        +cell tuple
 | 
				
			||||||
 | 
					        +cell #[code (rows, dims)] pairs.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+h(2, "from_glove") Vectors.from_glove
 | 
				
			||||||
 | 
					    +tag method
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					p
 | 
				
			||||||
 | 
					    |  Load #[+a("https://nlp.stanford.edu/projects/glove/") GloVe] vectors from
 | 
				
			||||||
 | 
					    |  a directory. Assumes binary format, that the vocab is in a
 | 
				
			||||||
 | 
					    |  #[code vocab.txt], and that vectors are named
 | 
				
			||||||
 | 
					    |  #[code vectors.{size}.[fd].bin], e.g. #[code vectors.128.f.bin] for 128d
 | 
				
			||||||
 | 
					    |  float32 vectors, #[code vectors.300.d.bin] for 300d float64 (double)
 | 
				
			||||||
 | 
					    |  vectors, etc. By default GloVe outputs 64-bit vectors.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+table(["Name", "Type", "Description"])
 | 
				
			||||||
 | 
					    +row
 | 
				
			||||||
 | 
					        +cell #[code path]
 | 
				
			||||||
 | 
					        +cell unicode / #[code Path]
 | 
				
			||||||
 | 
					        +cell The path to load the GloVe vectors from.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+h(2, "to_disk") Vectors.to_disk
 | 
				
			||||||
 | 
					    +tag method
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					p Save the current state to a directory.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+aside-code("Example").
 | 
				
			||||||
 | 
					    vectors.to_disk('/path/to/vectors')
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+table(["Name", "Type", "Description"])
 | 
				
			||||||
 | 
					    +row
 | 
				
			||||||
 | 
					        +cell #[code path]
 | 
				
			||||||
 | 
					        +cell unicode or #[code Path]
 | 
				
			||||||
 | 
					        +cell
 | 
				
			||||||
 | 
					            |  A path to a directory, which will be created if it doesn't exist.
 | 
				
			||||||
 | 
					            |  Paths may be either strings or #[code Path]-like objects.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+h(2, "from_disk") Vectors.from_disk
 | 
				
			||||||
 | 
					    +tag method
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					p Loads state from a directory. Modifies the object in place and returns it.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+aside-code("Example").
 | 
				
			||||||
 | 
					    vectors = Vectors(StringStore())
 | 
				
			||||||
 | 
					    vectors.from_disk('/path/to/vectors')
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+table(["Name", "Type", "Description"])
 | 
				
			||||||
 | 
					    +row
 | 
				
			||||||
 | 
					        +cell #[code path]
 | 
				
			||||||
 | 
					        +cell unicode or #[code Path]
 | 
				
			||||||
 | 
					        +cell
 | 
				
			||||||
 | 
					            |  A path to a directory. Paths may be either strings or
 | 
				
			||||||
 | 
					            |  #[code Path]-like objects.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    +row("foot")
 | 
				
			||||||
 | 
					        +cell returns
 | 
				
			||||||
 | 
					        +cell #[code Vectors]
 | 
				
			||||||
 | 
					        +cell The modified #[code Vectors] object.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+h(2, "to_bytes") Vectors.to_bytes
 | 
				
			||||||
 | 
					    +tag method
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					p Serialize the current state to a binary string.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+aside-code("Example").
 | 
				
			||||||
 | 
					    vectors_bytes = vectors.to_bytes()
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+table(["Name", "Type", "Description"])
 | 
				
			||||||
 | 
					    +row
 | 
				
			||||||
 | 
					        +cell #[code **exclude]
 | 
				
			||||||
 | 
					        +cell -
 | 
				
			||||||
 | 
					        +cell Named attributes to prevent from being serialized.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    +row("foot")
 | 
				
			||||||
 | 
					        +cell returns
 | 
				
			||||||
 | 
					        +cell bytes
 | 
				
			||||||
 | 
					        +cell The serialized form of the #[code Vectors] object.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+h(2, "from_bytes") Vectors.from_bytes
 | 
				
			||||||
 | 
					    +tag method
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					p Load state from a binary string.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+aside-code("Example").
 | 
				
			||||||
 | 
					    fron spacy.vectors import Vectors
 | 
				
			||||||
 | 
					    vectors_bytes = vectors.to_bytes()
 | 
				
			||||||
 | 
					    new_vectors = Vectors(StringStore())
 | 
				
			||||||
 | 
					    new_vectors.from_bytes(vectors_bytes)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+table(["Name", "Type", "Description"])
 | 
				
			||||||
 | 
					    +row
 | 
				
			||||||
 | 
					        +cell #[code bytes_data]
 | 
				
			||||||
 | 
					        +cell bytes
 | 
				
			||||||
 | 
					        +cell The data to load from.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    +row
 | 
				
			||||||
 | 
					        +cell #[code **exclude]
 | 
				
			||||||
 | 
					        +cell -
 | 
				
			||||||
 | 
					        +cell Named attributes to prevent from being loaded.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    +row("foot")
 | 
				
			||||||
 | 
					        +cell returns
 | 
				
			||||||
 | 
					        +cell #[code Vectors]
 | 
				
			||||||
 | 
					        +cell The #[code Vectors] object.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+h(2, "attributes") Attributes
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+table(["Name", "Type", "Description"])
 | 
				
			||||||
 | 
					    +row
 | 
				
			||||||
 | 
					        +cell #[code data]
 | 
				
			||||||
 | 
					        +cell #[code numpy.ndarray] / #[code cupy.ndarray]
 | 
				
			||||||
 | 
					        +cell
 | 
				
			||||||
 | 
					            |  Stored vectors data. #[code numpy] is used for CPU vectors,
 | 
				
			||||||
 | 
					            |  #[code cupy] for GPU vectors.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    +row
 | 
				
			||||||
 | 
					        +cell #[code key2row]
 | 
				
			||||||
 | 
					        +cell dict
 | 
				
			||||||
 | 
					        +cell
 | 
				
			||||||
 | 
					            |  Dictionary mapping word hashes to rows in the
 | 
				
			||||||
 | 
					            |  #[code Vectors.data] table.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    +row
 | 
				
			||||||
 | 
					        +cell #[code keys]
 | 
				
			||||||
 | 
					        +cell #[code numpy.ndarray]
 | 
				
			||||||
 | 
					        +cell
 | 
				
			||||||
 | 
					            |  Array keeping the keys in order, such that
 | 
				
			||||||
 | 
					            |  #[code keys[vectors.key2row[key]] == key]
 | 
				
			||||||
| 
						 | 
					@ -1,17 +1,22 @@
 | 
				
			||||||
//- 💫 DOCS > API > VOCAB
 | 
					//- 💫 DOCS > API > VOCAB
 | 
				
			||||||
 | 
					
 | 
				
			||||||
include ../../_includes/_mixins
 | 
					include ../_includes/_mixins
 | 
				
			||||||
 | 
					
 | 
				
			||||||
p
 | 
					p
 | 
				
			||||||
    |  A lookup table that allows you to access #[code Lexeme] objects. The
 | 
					    |  The #[code Vocab] object provides a lookup table that allows you to
 | 
				
			||||||
    |  #[code Vocab] instance also provides access to the #[code StringStore],
 | 
					    |  access #[+api("lexeme") #[code Lexeme]] objects, as well as the
 | 
				
			||||||
    |  and owns underlying C-data that is shared between #[code Doc] objects.
 | 
					    |  #[+api("stringstore") #[code StringStore]]. It also owns underlying
 | 
				
			||||||
 | 
					    |  C-data that is shared between #[code Doc] objects.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
+h(2, "init") Vocab.__init__
 | 
					+h(2, "init") Vocab.__init__
 | 
				
			||||||
    +tag method
 | 
					    +tag method
 | 
				
			||||||
 | 
					
 | 
				
			||||||
p Create the vocabulary.
 | 
					p Create the vocabulary.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+aside-code("Example").
 | 
				
			||||||
 | 
					    from spacy.vocab import Vocab
 | 
				
			||||||
 | 
					    vocab = Vocab(strings=[u'hello', u'world'])
 | 
				
			||||||
 | 
					
 | 
				
			||||||
+table(["Name", "Type", "Description"])
 | 
					+table(["Name", "Type", "Description"])
 | 
				
			||||||
    +row
 | 
					    +row
 | 
				
			||||||
        +cell #[code lex_attr_getters]
 | 
					        +cell #[code lex_attr_getters]
 | 
				
			||||||
| 
						 | 
					@ -39,7 +44,7 @@ p Create the vocabulary.
 | 
				
			||||||
            |  A #[+api("stringstore") #[code StringStore]] that maps
 | 
					            |  A #[+api("stringstore") #[code StringStore]] that maps
 | 
				
			||||||
            |  strings to hash values, and vice versa, or a list of strings.
 | 
					            |  strings to hash values, and vice versa, or a list of strings.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell returns
 | 
					        +cell returns
 | 
				
			||||||
        +cell #[code Vocab]
 | 
					        +cell #[code Vocab]
 | 
				
			||||||
        +cell The newly constructed object.
 | 
					        +cell The newly constructed object.
 | 
				
			||||||
| 
						 | 
					@ -54,7 +59,7 @@ p Get the current number of lexemes in the vocabulary.
 | 
				
			||||||
    assert len(nlp.vocab) > 0
 | 
					    assert len(nlp.vocab) > 0
 | 
				
			||||||
 | 
					
 | 
				
			||||||
+table(["Name", "Type", "Description"])
 | 
					+table(["Name", "Type", "Description"])
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell returns
 | 
					        +cell returns
 | 
				
			||||||
        +cell int
 | 
					        +cell int
 | 
				
			||||||
        +cell The number of lexems in the vocabulary.
 | 
					        +cell The number of lexems in the vocabulary.
 | 
				
			||||||
| 
						 | 
					@ -76,7 +81,7 @@ p
 | 
				
			||||||
        +cell int / unicode
 | 
					        +cell int / unicode
 | 
				
			||||||
        +cell The hash value of a word, or its unicode string.
 | 
					        +cell The hash value of a word, or its unicode string.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell returns
 | 
					        +cell returns
 | 
				
			||||||
        +cell #[code Lexeme]
 | 
					        +cell #[code Lexeme]
 | 
				
			||||||
        +cell The lexeme indicated by the given ID.
 | 
					        +cell The lexeme indicated by the given ID.
 | 
				
			||||||
| 
						 | 
					@ -90,7 +95,7 @@ p Iterate over the lexemes in the vocabulary.
 | 
				
			||||||
    stop_words = (lex for lex in nlp.vocab if lex.is_stop)
 | 
					    stop_words = (lex for lex in nlp.vocab if lex.is_stop)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
+table(["Name", "Type", "Description"])
 | 
					+table(["Name", "Type", "Description"])
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell yields
 | 
					        +cell yields
 | 
				
			||||||
        +cell #[code Lexeme]
 | 
					        +cell #[code Lexeme]
 | 
				
			||||||
        +cell An entry in the vocabulary.
 | 
					        +cell An entry in the vocabulary.
 | 
				
			||||||
| 
						 | 
					@ -115,7 +120,7 @@ p
 | 
				
			||||||
        +cell unicode
 | 
					        +cell unicode
 | 
				
			||||||
        +cell The ID string.
 | 
					        +cell The ID string.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell returns
 | 
					        +cell returns
 | 
				
			||||||
        +cell bool
 | 
					        +cell bool
 | 
				
			||||||
        +cell Whether the string has an entry in the vocabulary.
 | 
					        +cell Whether the string has an entry in the vocabulary.
 | 
				
			||||||
| 
						 | 
					@ -152,11 +157,100 @@ p
 | 
				
			||||||
            |  which the flag will be stored. If #[code -1], the lowest
 | 
					            |  which the flag will be stored. If #[code -1], the lowest
 | 
				
			||||||
            |  available bit will be chosen.
 | 
					            |  available bit will be chosen.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell returns
 | 
					        +cell returns
 | 
				
			||||||
        +cell int
 | 
					        +cell int
 | 
				
			||||||
        +cell The integer ID by which the flag value can be checked.
 | 
					        +cell The integer ID by which the flag value can be checked.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+h(2, "add_flag") Vocab.clear_vectors
 | 
				
			||||||
 | 
					    +tag method
 | 
				
			||||||
 | 
					    +tag-new(2)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					p
 | 
				
			||||||
 | 
					    |  Drop the current vector table. Because all vectors must be the same
 | 
				
			||||||
 | 
					    |  width, you have to call this to change the size of the vectors.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+aside-code("Example").
 | 
				
			||||||
 | 
					    nlp.vocab.clear_vectors(new_dim=300)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+table(["Name", "Type", "Description"])
 | 
				
			||||||
 | 
					    +row
 | 
				
			||||||
 | 
					        +cell #[code new_dim]
 | 
				
			||||||
 | 
					        +cell int
 | 
				
			||||||
 | 
					        +cell
 | 
				
			||||||
 | 
					            |  Number of dimensions of the new vectors. If #[code None], size
 | 
				
			||||||
 | 
					            |  is not changed.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+h(2, "add_flag") Vocab.get_vector
 | 
				
			||||||
 | 
					    +tag method
 | 
				
			||||||
 | 
					    +tag-new(2)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					p
 | 
				
			||||||
 | 
					    |  Retrieve a vector for a word in the vocabulary. Words can be looked up
 | 
				
			||||||
 | 
					    |  by string or hash value. If no vectors data is loaded, a
 | 
				
			||||||
 | 
					    |  #[code ValueError] is raised.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+aside-code("Example").
 | 
				
			||||||
 | 
					    nlp.vocab.get_vector(u'apple')
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+table(["Name", "Type", "Description"])
 | 
				
			||||||
 | 
					    +row
 | 
				
			||||||
 | 
					        +cell #[code orth]
 | 
				
			||||||
 | 
					        +cell int / unicode
 | 
				
			||||||
 | 
					        +cell The hash value of a word, or its unicode string.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    +row("foot")
 | 
				
			||||||
 | 
					        +cell returns
 | 
				
			||||||
 | 
					        +cell #[code.u-break numpy.ndarray[ndim=1, dtype='float32']]
 | 
				
			||||||
 | 
					        +cell
 | 
				
			||||||
 | 
					            |  A word vector. Size and shape are determined by the
 | 
				
			||||||
 | 
					            |  #[code Vocab.vectors] instance.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+h(2, "add_flag") Vocab.set_vector
 | 
				
			||||||
 | 
					    +tag method
 | 
				
			||||||
 | 
					    +tag-new(2)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					p
 | 
				
			||||||
 | 
					    |  Set a vector for a word in the vocabulary. Words can be referenced by
 | 
				
			||||||
 | 
					    |  by string or hash value.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+aside-code("Example").
 | 
				
			||||||
 | 
					    nlp.vocab.set_vector(u'apple', array([...]))
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+table(["Name", "Type", "Description"])
 | 
				
			||||||
 | 
					    +row
 | 
				
			||||||
 | 
					        +cell #[code orth]
 | 
				
			||||||
 | 
					        +cell int / unicode
 | 
				
			||||||
 | 
					        +cell The hash value of a word, or its unicode string.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    +row
 | 
				
			||||||
 | 
					        +cell #[code vector]
 | 
				
			||||||
 | 
					        +cell #[code.u-break numpy.ndarray[ndim=1, dtype='float32']]
 | 
				
			||||||
 | 
					        +cell The vector to set.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+h(2, "add_flag") Vocab.has_vector
 | 
				
			||||||
 | 
					    +tag method
 | 
				
			||||||
 | 
					    +tag-new(2)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					p
 | 
				
			||||||
 | 
					    |  Check whether a word has a vector. Returns #[code False] if no vectors
 | 
				
			||||||
 | 
					    |  are loaded. Words can be looked up by string or hash value.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+aside-code("Example").
 | 
				
			||||||
 | 
					    if nlp.vocab.has_vector(u'apple'):
 | 
				
			||||||
 | 
					        vector = nlp.vocab.get_vector(u'apple')
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+table(["Name", "Type", "Description"])
 | 
				
			||||||
 | 
					    +row
 | 
				
			||||||
 | 
					        +cell #[code orth]
 | 
				
			||||||
 | 
					        +cell int / unicode
 | 
				
			||||||
 | 
					        +cell The hash value of a word, or its unicode string.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    +row("foot")
 | 
				
			||||||
 | 
					        +cell returns
 | 
				
			||||||
 | 
					        +cell bool
 | 
				
			||||||
 | 
					        +cell Whether the word has a vector.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
+h(2, "to_disk") Vocab.to_disk
 | 
					+h(2, "to_disk") Vocab.to_disk
 | 
				
			||||||
    +tag method
 | 
					    +tag method
 | 
				
			||||||
    +tag-new(2)
 | 
					    +tag-new(2)
 | 
				
			||||||
| 
						 | 
					@ -192,7 +286,7 @@ p Loads state from a directory. Modifies the object in place and returns it.
 | 
				
			||||||
            |  A path to a directory. Paths may be either strings or
 | 
					            |  A path to a directory. Paths may be either strings or
 | 
				
			||||||
            |  #[code Path]-like objects.
 | 
					            |  #[code Path]-like objects.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell returns
 | 
					        +cell returns
 | 
				
			||||||
        +cell #[code Vocab]
 | 
					        +cell #[code Vocab]
 | 
				
			||||||
        +cell The modified #[code Vocab] object.
 | 
					        +cell The modified #[code Vocab] object.
 | 
				
			||||||
| 
						 | 
					@ -211,7 +305,7 @@ p Serialize the current state to a binary string.
 | 
				
			||||||
        +cell -
 | 
					        +cell -
 | 
				
			||||||
        +cell Named attributes to prevent from being serialized.
 | 
					        +cell Named attributes to prevent from being serialized.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell returns
 | 
					        +cell returns
 | 
				
			||||||
        +cell bytes
 | 
					        +cell bytes
 | 
				
			||||||
        +cell The serialized form of the #[code Vocab] object.
 | 
					        +cell The serialized form of the #[code Vocab] object.
 | 
				
			||||||
| 
						 | 
					@ -238,7 +332,7 @@ p Load state from a binary string.
 | 
				
			||||||
        +cell -
 | 
					        +cell -
 | 
				
			||||||
        +cell Named attributes to prevent from being loaded.
 | 
					        +cell Named attributes to prevent from being loaded.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    +footrow
 | 
					    +row("foot")
 | 
				
			||||||
        +cell returns
 | 
					        +cell returns
 | 
				
			||||||
        +cell #[code Vocab]
 | 
					        +cell #[code Vocab]
 | 
				
			||||||
        +cell The #[code Vocab] object.
 | 
					        +cell The #[code Vocab] object.
 | 
				
			||||||
| 
						 | 
					@ -256,3 +350,14 @@ p Load state from a binary string.
 | 
				
			||||||
        +cell #[code strings]
 | 
					        +cell #[code strings]
 | 
				
			||||||
        +cell #[code StringStore]
 | 
					        +cell #[code StringStore]
 | 
				
			||||||
        +cell A table managing the string-to-int mapping.
 | 
					        +cell A table managing the string-to-int mapping.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    +row
 | 
				
			||||||
 | 
					        +cell #[code vectors]
 | 
				
			||||||
 | 
					            +tag-new(2)
 | 
				
			||||||
 | 
					        +cell #[code Vectors]
 | 
				
			||||||
 | 
					        +cell A table associating word IDs to word vectors.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    +row
 | 
				
			||||||
 | 
					        +cell #[code vectors_length]
 | 
				
			||||||
 | 
					        +cell int
 | 
				
			||||||
 | 
					        +cell Number of dimensions for each word vector.
 | 
				
			||||||
| 
						 | 
					@ -1,156 +0,0 @@
 | 
				
			||||||
//- 💫 DOCS > API > ANNOTATION SPECS
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
include ../../_includes/_mixins
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
p This document describes the target annotations spaCy is trained to predict.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
+h(2, "tokenization") Tokenization
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
p
 | 
					 | 
				
			||||||
    |  Tokenization standards are based on the
 | 
					 | 
				
			||||||
    |  #[+a("https://catalog.ldc.upenn.edu/LDC2013T19") OntoNotes 5] corpus.
 | 
					 | 
				
			||||||
    |  The tokenizer differs from most by including tokens for significant
 | 
					 | 
				
			||||||
    |  whitespace. Any sequence of whitespace characters beyond a single space
 | 
					 | 
				
			||||||
    |  (#[code ' ']) is included as a token.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
+aside-code("Example").
 | 
					 | 
				
			||||||
    from spacy.lang.en import English
 | 
					 | 
				
			||||||
    nlp = English()
 | 
					 | 
				
			||||||
    tokens = nlp('Some\nspaces  and\ttab characters')
 | 
					 | 
				
			||||||
    tokens_text = [t.text for t in tokens]
 | 
					 | 
				
			||||||
    assert tokens_text == ['Some', '\n', 'spaces', ' ', 'and',
 | 
					 | 
				
			||||||
                           '\t', 'tab', 'characters']
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
p
 | 
					 | 
				
			||||||
    |  The whitespace tokens are useful for much the same reason punctuation is
 | 
					 | 
				
			||||||
    |  – it's often an important delimiter in the text. By preserving it in the
 | 
					 | 
				
			||||||
    |  token output, we are able to maintain a simple alignment between the
 | 
					 | 
				
			||||||
    |  tokens and the original string, and we ensure that no information is
 | 
					 | 
				
			||||||
    |  lost during processing.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
+h(2, "sentence-boundary") Sentence boundary detection
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
p
 | 
					 | 
				
			||||||
    |  Sentence boundaries are calculated from the syntactic parse tree, so
 | 
					 | 
				
			||||||
    |  features such as punctuation and capitalisation play an important but
 | 
					 | 
				
			||||||
    |  non-decisive role in determining the sentence boundaries. Usually this
 | 
					 | 
				
			||||||
    |  means that the sentence boundaries will at least coincide with clause
 | 
					 | 
				
			||||||
    |  boundaries, even given poorly punctuated text.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
+h(2, "pos-tagging") Part-of-speech Tagging
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
+aside("Tip: Understanding tags")
 | 
					 | 
				
			||||||
    |  You can also use #[code spacy.explain()] to get the description for the
 | 
					 | 
				
			||||||
    |  string representation of a tag. For example,
 | 
					 | 
				
			||||||
    |  #[code spacy.explain("RB")] will return "adverb".
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
include _annotation/_pos-tags
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
+h(2, "lemmatization") Lemmatization
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
p A "lemma" is the uninflected form of a word. In English, this means:
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
+list
 | 
					 | 
				
			||||||
    +item #[strong Adjectives]: The form like "happy", not "happier" or "happiest"
 | 
					 | 
				
			||||||
    +item #[strong Adverbs]: The form like "badly", not "worse" or "worst"
 | 
					 | 
				
			||||||
    +item #[strong Nouns]: The form like "dog", not "dogs"; like "child", not "children"
 | 
					 | 
				
			||||||
    +item #[strong Verbs]: The form like "write", not "writes", "writing", "wrote" or "written"
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
p
 | 
					 | 
				
			||||||
    |  The lemmatization data is taken from
 | 
					 | 
				
			||||||
    |  #[+a("https://wordnet.princeton.edu") WordNet]. However, we also add a
 | 
					 | 
				
			||||||
    |  special case for pronouns: all pronouns are lemmatized to the special
 | 
					 | 
				
			||||||
    |  token #[code -PRON-].
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
+infobox("About spaCy's custom pronoun lemma")
 | 
					 | 
				
			||||||
    |  Unlike verbs and common nouns, there's no clear base form of a personal
 | 
					 | 
				
			||||||
    |  pronoun. Should the lemma of "me" be "I", or should we normalize person
 | 
					 | 
				
			||||||
    |  as well, giving "it" — or maybe "he"? spaCy's solution is to introduce a
 | 
					 | 
				
			||||||
    |  novel symbol, #[code -PRON-], which is used as the lemma for
 | 
					 | 
				
			||||||
    |  all personal pronouns.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
+h(2, "dependency-parsing") Syntactic Dependency Parsing
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
+aside("Tip: Understanding labels")
 | 
					 | 
				
			||||||
    |  You can also use #[code spacy.explain()] to get the description for the
 | 
					 | 
				
			||||||
    |  string representation of a label. For example,
 | 
					 | 
				
			||||||
    |  #[code spacy.explain("prt")] will return "particle".
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
include _annotation/_dep-labels
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
+h(2, "named-entities") Named Entity Recognition
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
+aside("Tip: Understanding entity types")
 | 
					 | 
				
			||||||
    |  You can also use #[code spacy.explain()] to get the description for the
 | 
					 | 
				
			||||||
    |  string representation of an entity label. For example,
 | 
					 | 
				
			||||||
    |  #[code spacy.explain("LANGUAGE")] will return "any named language".
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
include _annotation/_named-entities
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
+h(3, "biluo") BILUO Scheme
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
p
 | 
					 | 
				
			||||||
    |  spaCy translates character offsets into the BILUO scheme, in order to
 | 
					 | 
				
			||||||
    |  decide the cost of each action given the current state of the entity
 | 
					 | 
				
			||||||
    |  recognizer. The costs are then used to calculate the gradient of the
 | 
					 | 
				
			||||||
    |  loss, to train the model.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
+aside("Why BILUO, not IOB?")
 | 
					 | 
				
			||||||
    |  There are several coding schemes for encoding entity annotations as
 | 
					 | 
				
			||||||
    |  token tags.  These coding schemes are equally expressive, but not
 | 
					 | 
				
			||||||
    |  necessarily equally learnable.
 | 
					 | 
				
			||||||
    |  #[+a("http://www.aclweb.org/anthology/W09-1119") Ratinov and Roth]
 | 
					 | 
				
			||||||
    |  showed that the minimal #[strong Begin], #[strong In], #[strong Out]
 | 
					 | 
				
			||||||
    |  scheme was more difficult to learn than the #[strong BILUO] scheme that
 | 
					 | 
				
			||||||
    |  we use, which explicitly marks boundary tokens.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
+table([ "Tag", "Description" ])
 | 
					 | 
				
			||||||
    +row
 | 
					 | 
				
			||||||
        +cell #[code #[span.u-color-theme B] EGIN]
 | 
					 | 
				
			||||||
        +cell The first token of a multi-token entity.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
    +row
 | 
					 | 
				
			||||||
        +cell #[code #[span.u-color-theme I] N]
 | 
					 | 
				
			||||||
        +cell An inner token of a multi-token entity.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
    +row
 | 
					 | 
				
			||||||
        +cell #[code #[span.u-color-theme L] AST]
 | 
					 | 
				
			||||||
        +cell The final token of a multi-token entity.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
    +row
 | 
					 | 
				
			||||||
        +cell #[code #[span.u-color-theme U] NIT]
 | 
					 | 
				
			||||||
        +cell A single-token entity.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
    +row
 | 
					 | 
				
			||||||
        +cell #[code #[span.u-color-theme O] UT]
 | 
					 | 
				
			||||||
        +cell A non-entity token.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
+h(2, "json-input") JSON input format for training
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
p
 | 
					 | 
				
			||||||
    |  spaCy takes training data in the following format:
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
+code("Example structure").
 | 
					 | 
				
			||||||
    doc: {
 | 
					 | 
				
			||||||
        id: string,
 | 
					 | 
				
			||||||
        paragraphs: [{
 | 
					 | 
				
			||||||
            raw: string,
 | 
					 | 
				
			||||||
            sents: [int],
 | 
					 | 
				
			||||||
            tokens: [{
 | 
					 | 
				
			||||||
                start: int,
 | 
					 | 
				
			||||||
                tag: string,
 | 
					 | 
				
			||||||
                head: int,
 | 
					 | 
				
			||||||
                dep: string
 | 
					 | 
				
			||||||
            }],
 | 
					 | 
				
			||||||
            ner: [{
 | 
					 | 
				
			||||||
                start: int,
 | 
					 | 
				
			||||||
                end: int,
 | 
					 | 
				
			||||||
                label: string
 | 
					 | 
				
			||||||
            }],
 | 
					 | 
				
			||||||
            brackets: [{
 | 
					 | 
				
			||||||
                start: int,
 | 
					 | 
				
			||||||
                end: int,
 | 
					 | 
				
			||||||
                label: string
 | 
					 | 
				
			||||||
            }]
 | 
					 | 
				
			||||||
        }]
 | 
					 | 
				
			||||||
    }
 | 
					 | 
				
			||||||
| 
						 | 
					@ -1,111 +0,0 @@
 | 
				
			||||||
//- 💫 DOCS > API > DEPENDENCYPARSER
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
include ../../_includes/_mixins
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
p Annotate syntactic dependencies on #[code Doc] objects.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
+under-construction
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
+h(2, "init") DependencyParser.__init__
 | 
					 | 
				
			||||||
    +tag method
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
p Create a #[code DependencyParser].
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
+table(["Name", "Type", "Description"])
 | 
					 | 
				
			||||||
    +row
 | 
					 | 
				
			||||||
        +cell #[code vocab]
 | 
					 | 
				
			||||||
        +cell #[code Vocab]
 | 
					 | 
				
			||||||
        +cell The vocabulary. Must be shared with documents to be processed.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
    +row
 | 
					 | 
				
			||||||
        +cell #[code model]
 | 
					 | 
				
			||||||
        +cell #[thinc.linear.AveragedPerceptron]
 | 
					 | 
				
			||||||
        +cell The statistical model.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
    +footrow
 | 
					 | 
				
			||||||
        +cell returns
 | 
					 | 
				
			||||||
        +cell #[code DependencyParser]
 | 
					 | 
				
			||||||
        +cell The newly constructed object.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
+h(2, "call") DependencyParser.__call__
 | 
					 | 
				
			||||||
    +tag method
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
p
 | 
					 | 
				
			||||||
    |  Apply the dependency parser, setting the heads and dependency relations
 | 
					 | 
				
			||||||
    |  onto the #[code Doc] object.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
+table(["Name", "Type", "Description"])
 | 
					 | 
				
			||||||
    +row
 | 
					 | 
				
			||||||
        +cell #[code doc]
 | 
					 | 
				
			||||||
        +cell #[code Doc]
 | 
					 | 
				
			||||||
        +cell The document to be processed.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
    +footrow
 | 
					 | 
				
			||||||
        +cell returns
 | 
					 | 
				
			||||||
        +cell #[code None]
 | 
					 | 
				
			||||||
        +cell -
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
+h(2, "pipe") DependencyParser.pipe
 | 
					 | 
				
			||||||
    +tag method
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
p Process a stream of documents.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
+table(["Name", "Type", "Description"])
 | 
					 | 
				
			||||||
    +row
 | 
					 | 
				
			||||||
        +cell #[code stream]
 | 
					 | 
				
			||||||
        +cell -
 | 
					 | 
				
			||||||
        +cell The sequence of documents to process.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
    +row
 | 
					 | 
				
			||||||
        +cell #[code batch_size]
 | 
					 | 
				
			||||||
        +cell int
 | 
					 | 
				
			||||||
        +cell The number of documents to accumulate into a working set.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
    +row
 | 
					 | 
				
			||||||
        +cell #[code n_threads]
 | 
					 | 
				
			||||||
        +cell int
 | 
					 | 
				
			||||||
        +cell
 | 
					 | 
				
			||||||
            |  The number of threads with which to work on the buffer in
 | 
					 | 
				
			||||||
            |  parallel.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
    +footrow
 | 
					 | 
				
			||||||
        +cell yields
 | 
					 | 
				
			||||||
        +cell #[code Doc]
 | 
					 | 
				
			||||||
        +cell Documents, in order.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
+h(2, "update") DependencyParser.update
 | 
					 | 
				
			||||||
    +tag method
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
p Update the statistical model.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
+table(["Name", "Type", "Description"])
 | 
					 | 
				
			||||||
    +row
 | 
					 | 
				
			||||||
        +cell #[code doc]
 | 
					 | 
				
			||||||
        +cell #[code Doc]
 | 
					 | 
				
			||||||
        +cell The example document for the update.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
    +row
 | 
					 | 
				
			||||||
        +cell #[code gold]
 | 
					 | 
				
			||||||
        +cell #[code GoldParse]
 | 
					 | 
				
			||||||
        +cell The gold-standard annotations, to calculate the loss.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
    +footrow
 | 
					 | 
				
			||||||
        +cell returns
 | 
					 | 
				
			||||||
        +cell int
 | 
					 | 
				
			||||||
        +cell The loss on this example.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
+h(2, "step_through") DependencyParser.step_through
 | 
					 | 
				
			||||||
    +tag method
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
p Set up a stepwise state, to introspect and control the transition sequence.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
+table(["Name", "Type", "Description"])
 | 
					 | 
				
			||||||
    +row
 | 
					 | 
				
			||||||
        +cell #[code doc]
 | 
					 | 
				
			||||||
        +cell #[code Doc]
 | 
					 | 
				
			||||||
        +cell The document to step through.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
    +footrow
 | 
					 | 
				
			||||||
        +cell returns
 | 
					 | 
				
			||||||
        +cell #[code StepwiseState]
 | 
					 | 
				
			||||||
        +cell A state object, to step through the annotation process.
 | 
					 | 
				
			||||||
| 
						 | 
					@ -1,109 +0,0 @@
 | 
				
			||||||
//- 💫 DOCS > API > ENTITYRECOGNIZER
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
include ../../_includes/_mixins
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
p Annotate named entities on #[code Doc] objects.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
+under-construction
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
+h(2, "init") EntityRecognizer.__init__
 | 
					 | 
				
			||||||
    +tag method
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
p Create an #[code EntityRecognizer].
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
+table(["Name", "Type", "Description"])
 | 
					 | 
				
			||||||
    +row
 | 
					 | 
				
			||||||
        +cell #[code vocab]
 | 
					 | 
				
			||||||
        +cell #[code Vocab]
 | 
					 | 
				
			||||||
        +cell The vocabulary. Must be shared with documents to be processed.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
    +row
 | 
					 | 
				
			||||||
        +cell #[code model]
 | 
					 | 
				
			||||||
        +cell #[thinc.linear.AveragedPerceptron]
 | 
					 | 
				
			||||||
        +cell The statistical model.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
    +footrow
 | 
					 | 
				
			||||||
        +cell returns
 | 
					 | 
				
			||||||
        +cell #[code EntityRecognizer]
 | 
					 | 
				
			||||||
        +cell The newly constructed object.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
+h(2, "call") EntityRecognizer.__call__
 | 
					 | 
				
			||||||
    +tag method
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
p Apply the entity recognizer, setting the NER tags onto the #[code Doc] object.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
+table(["Name", "Type", "Description"])
 | 
					 | 
				
			||||||
    +row
 | 
					 | 
				
			||||||
        +cell #[code doc]
 | 
					 | 
				
			||||||
        +cell #[code Doc]
 | 
					 | 
				
			||||||
        +cell The document to be processed.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
    +footrow
 | 
					 | 
				
			||||||
        +cell returns
 | 
					 | 
				
			||||||
        +cell #[code None]
 | 
					 | 
				
			||||||
        +cell -
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
+h(2, "pipe") EntityRecognizer.pipe
 | 
					 | 
				
			||||||
    +tag method
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
p Process a stream of documents.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
+table(["Name", "Type", "Description"])
 | 
					 | 
				
			||||||
    +row
 | 
					 | 
				
			||||||
        +cell #[code stream]
 | 
					 | 
				
			||||||
        +cell -
 | 
					 | 
				
			||||||
        +cell The sequence of documents to process.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
    +row
 | 
					 | 
				
			||||||
        +cell #[code batch_size]
 | 
					 | 
				
			||||||
        +cell int
 | 
					 | 
				
			||||||
        +cell The number of documents to accumulate into a working set.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
    +row
 | 
					 | 
				
			||||||
        +cell #[code n_threads]
 | 
					 | 
				
			||||||
        +cell int
 | 
					 | 
				
			||||||
        +cell
 | 
					 | 
				
			||||||
            |  The number of threads with which to work on the buffer in
 | 
					 | 
				
			||||||
            |  parallel.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
    +footrow
 | 
					 | 
				
			||||||
        +cell yields
 | 
					 | 
				
			||||||
        +cell #[code Doc]
 | 
					 | 
				
			||||||
        +cell Documents, in order.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
+h(2, "update") EntityRecognizer.update
 | 
					 | 
				
			||||||
    +tag method
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
p Update the statistical model.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
+table(["Name", "Type", "Description"])
 | 
					 | 
				
			||||||
    +row
 | 
					 | 
				
			||||||
        +cell #[code doc]
 | 
					 | 
				
			||||||
        +cell #[code Doc]
 | 
					 | 
				
			||||||
        +cell The example document for the update.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
    +row
 | 
					 | 
				
			||||||
        +cell #[code gold]
 | 
					 | 
				
			||||||
        +cell #[code GoldParse]
 | 
					 | 
				
			||||||
        +cell The gold-standard annotations, to calculate the loss.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
    +footrow
 | 
					 | 
				
			||||||
        +cell returns
 | 
					 | 
				
			||||||
        +cell int
 | 
					 | 
				
			||||||
        +cell The loss on this example.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
+h(2, "step_through") EntityRecognizer.step_through
 | 
					 | 
				
			||||||
    +tag method
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
p Set up a stepwise state, to introspect and control the transition sequence.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
+table(["Name", "Type", "Description"])
 | 
					 | 
				
			||||||
    +row
 | 
					 | 
				
			||||||
        +cell #[code doc]
 | 
					 | 
				
			||||||
        +cell #[code Doc]
 | 
					 | 
				
			||||||
        +cell The document to step through.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
    +footrow
 | 
					 | 
				
			||||||
        +cell returns
 | 
					 | 
				
			||||||
        +cell #[code StepwiseState]
 | 
					 | 
				
			||||||
        +cell A state object, to step through the annotation process.
 | 
					 | 
				
			||||||
| 
						 | 
					@ -1,241 +0,0 @@
 | 
				
			||||||
//- 💫 DOCS > API > FACTS & FIGURES
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
include ../../_includes/_mixins
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
+under-construction
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
+h(2, "comparison") Feature comparison
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
p
 | 
					 | 
				
			||||||
    |  Here's a quick comparison of the functionalities offered by spaCy,
 | 
					 | 
				
			||||||
    |  #[+a("https://github.com/tensorflow/models/tree/master/syntaxnet") SyntaxNet],
 | 
					 | 
				
			||||||
    |  #[+a("http://www.nltk.org/py-modindex.html") NLTK] and
 | 
					 | 
				
			||||||
    |  #[+a("http://stanfordnlp.github.io/CoreNLP/") CoreNLP].
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
+table([ "", "spaCy", "SyntaxNet", "NLTK", "CoreNLP"])
 | 
					 | 
				
			||||||
    +row
 | 
					 | 
				
			||||||
        +cell Easy installation
 | 
					 | 
				
			||||||
        each icon in [ "pro", "con", "pro", "pro" ]
 | 
					 | 
				
			||||||
            +cell.u-text-center #[+procon(icon)]
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
    +row
 | 
					 | 
				
			||||||
        +cell Python API
 | 
					 | 
				
			||||||
        each icon in [ "pro", "con", "pro", "con" ]
 | 
					 | 
				
			||||||
            +cell.u-text-center #[+procon(icon)]
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
    +row
 | 
					 | 
				
			||||||
        +cell Multi-language support
 | 
					 | 
				
			||||||
        each icon in [ "neutral", "pro", "pro", "pro" ]
 | 
					 | 
				
			||||||
            +cell.u-text-center #[+procon(icon)]
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
    +row
 | 
					 | 
				
			||||||
        +cell Tokenization
 | 
					 | 
				
			||||||
        each icon in [ "pro", "pro", "pro", "pro" ]
 | 
					 | 
				
			||||||
            +cell.u-text-center #[+procon(icon)]
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
    +row
 | 
					 | 
				
			||||||
        +cell Part-of-speech tagging
 | 
					 | 
				
			||||||
        each icon in [ "pro", "pro", "pro", "pro" ]
 | 
					 | 
				
			||||||
            +cell.u-text-center #[+procon(icon)]
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
    +row
 | 
					 | 
				
			||||||
        +cell Sentence segmentation
 | 
					 | 
				
			||||||
        each icon in [ "pro", "pro", "pro", "pro" ]
 | 
					 | 
				
			||||||
            +cell.u-text-center #[+procon(icon)]
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
    +row
 | 
					 | 
				
			||||||
        +cell Dependency parsing
 | 
					 | 
				
			||||||
        each icon in [ "pro", "pro", "con", "pro" ]
 | 
					 | 
				
			||||||
            +cell.u-text-center #[+procon(icon)]
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
    +row
 | 
					 | 
				
			||||||
        +cell Entity Recognition
 | 
					 | 
				
			||||||
        each icon in [ "pro", "con", "pro", "pro" ]
 | 
					 | 
				
			||||||
            +cell.u-text-center #[+procon(icon)]
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
    +row
 | 
					 | 
				
			||||||
        +cell Integrated word vectors
 | 
					 | 
				
			||||||
        each icon in [ "pro", "con", "con", "con" ]
 | 
					 | 
				
			||||||
            +cell.u-text-center #[+procon(icon)]
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
    +row
 | 
					 | 
				
			||||||
        +cell Sentiment analysis
 | 
					 | 
				
			||||||
        each icon in [ "pro", "con", "pro", "pro" ]
 | 
					 | 
				
			||||||
            +cell.u-text-center #[+procon(icon)]
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
    +row
 | 
					 | 
				
			||||||
        +cell Coreference resolution
 | 
					 | 
				
			||||||
        each icon in [ "con", "con", "con", "pro" ]
 | 
					 | 
				
			||||||
            +cell.u-text-center #[+procon(icon)]
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
+h(2, "benchmarks") Benchmarks
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
p
 | 
					 | 
				
			||||||
    |  Two peer-reviewed papers in 2015 confirm that spaCy offers the
 | 
					 | 
				
			||||||
    |  #[strong fastest syntactic parser in the world] and that
 | 
					 | 
				
			||||||
    |  #[strong its accuracy is within 1% of the best] available. The few
 | 
					 | 
				
			||||||
    |  systems that are more accurate are 20× slower or more.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
+aside("About the evaluation")
 | 
					 | 
				
			||||||
    |  The first of the evaluations was published by #[strong Yahoo! Labs] and
 | 
					 | 
				
			||||||
    |  #[strong Emory University], as part of a survey of current parsing
 | 
					 | 
				
			||||||
    |  technologies #[+a("https://aclweb.org/anthology/P/P15/P15-1038.pdf") (Choi et al., 2015)].
 | 
					 | 
				
			||||||
    |  Their results and subsequent discussions helped us develop a novel
 | 
					 | 
				
			||||||
    |  psychologically-motivated technique to improve spaCy's accuracy, which
 | 
					 | 
				
			||||||
    |  we published in joint work with Macquarie University
 | 
					 | 
				
			||||||
    |  #[+a("https://aclweb.org/anthology/D/D15/D15-1162.pdf") (Honnibal and Johnson, 2015)].
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
+table([ "System", "Language", "Accuracy", "Speed (wps)"])
 | 
					 | 
				
			||||||
    +row
 | 
					 | 
				
			||||||
        each data in [ "spaCy", "Cython", "91.8", "13,963" ]
 | 
					 | 
				
			||||||
            +cell #[strong=data]
 | 
					 | 
				
			||||||
    +row
 | 
					 | 
				
			||||||
        each data in [ "ClearNLP", "Java", "91.7", "10,271" ]
 | 
					 | 
				
			||||||
            +cell=data
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
    +row
 | 
					 | 
				
			||||||
        each data in [ "CoreNLP", "Java", "89.6", "8,602"]
 | 
					 | 
				
			||||||
            +cell=data
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
    +row
 | 
					 | 
				
			||||||
        each data in [ "MATE", "Java", "92.5", "550"]
 | 
					 | 
				
			||||||
            +cell=data
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
    +row
 | 
					 | 
				
			||||||
        each data in [ "Turbo", "C++", "92.4", "349" ]
 | 
					 | 
				
			||||||
            +cell=data
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
+h(3, "parse-accuracy") Parse accuracy
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
p
 | 
					 | 
				
			||||||
    |  In 2016, Google released their
 | 
					 | 
				
			||||||
    |  #[+a("https://github.com/tensorflow/models/tree/master/syntaxnet") SyntaxNet]
 | 
					 | 
				
			||||||
    |  library, setting a new state of the art for syntactic dependency parsing
 | 
					 | 
				
			||||||
    |  accuracy. SyntaxNet's algorithm is very similar to spaCy's. The main
 | 
					 | 
				
			||||||
    |  difference is that SyntaxNet uses a neural network while spaCy uses a
 | 
					 | 
				
			||||||
    |  sparse linear model.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
+aside("Methodology")
 | 
					 | 
				
			||||||
    |  #[+a("http://arxiv.org/abs/1603.06042") Andor et al. (2016)] chose
 | 
					 | 
				
			||||||
    |  slightly different experimental conditions from
 | 
					 | 
				
			||||||
    |  #[+a("https://aclweb.org/anthology/P/P15/P15-1038.pdf") Choi et al. (2015)],
 | 
					 | 
				
			||||||
    |  so the two accuracy tables here do not present directly comparable
 | 
					 | 
				
			||||||
    |  figures. We have only evaluated spaCy in the "News" condition following
 | 
					 | 
				
			||||||
    |  the SyntaxNet methodology. We don't yet have benchmark figures for the
 | 
					 | 
				
			||||||
    |  "Web" and "Questions" conditions.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
+table([ "System", "News", "Web", "Questions" ])
 | 
					 | 
				
			||||||
    +row
 | 
					 | 
				
			||||||
        +cell spaCy
 | 
					 | 
				
			||||||
        each data in [ 92.8, "n/a", "n/a" ]
 | 
					 | 
				
			||||||
            +cell=data
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
    +row
 | 
					 | 
				
			||||||
        +cell #[+a("https://github.com/tensorflow/models/tree/master/syntaxnet") Parsey McParseface]
 | 
					 | 
				
			||||||
        each data in [ 94.15, 89.08, 94.77 ]
 | 
					 | 
				
			||||||
            +cell=data
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
    +row
 | 
					 | 
				
			||||||
        +cell #[+a("http://www.cs.cmu.edu/~ark/TurboParser/") Martins et al. (2013)]
 | 
					 | 
				
			||||||
        each data in [ 93.10, 88.23, 94.21 ]
 | 
					 | 
				
			||||||
            +cell=data
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
    +row
 | 
					 | 
				
			||||||
        +cell #[+a("http://research.google.com/pubs/archive/38148.pdf") Zhang and McDonald (2014)]
 | 
					 | 
				
			||||||
        each data in [ 93.32, 88.65, 93.37 ]
 | 
					 | 
				
			||||||
            +cell=data
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
    +row
 | 
					 | 
				
			||||||
        +cell #[+a("http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43800.pdf") Weiss et al. (2015)]
 | 
					 | 
				
			||||||
        each data in [ 93.91, 89.29, 94.17 ]
 | 
					 | 
				
			||||||
            +cell=data
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
    +row
 | 
					 | 
				
			||||||
        +cell #[strong #[+a("http://arxiv.org/abs/1603.06042") Andor et al. (2016)]]
 | 
					 | 
				
			||||||
        each data in [ 94.44, 90.17, 95.40 ]
 | 
					 | 
				
			||||||
            +cell #[strong=data]
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
+h(3, "speed-comparison") Detailed speed comparison
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
p
 | 
					 | 
				
			||||||
    |  Here we compare the per-document processing time of various spaCy
 | 
					 | 
				
			||||||
    |  functionalities against other NLP libraries. We show both absolute
 | 
					 | 
				
			||||||
    |  timings (in ms) and relative performance (normalized to spaCy). Lower is
 | 
					 | 
				
			||||||
    |  better.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
+aside("Methodology")
 | 
					 | 
				
			||||||
    |  #[strong Set up:] 100,000 plain-text documents were streamed from an
 | 
					 | 
				
			||||||
    |  SQLite3 database, and processed with an NLP library, to one of three
 | 
					 | 
				
			||||||
    |  levels of detail — tokenization, tagging, or parsing. The tasks are
 | 
					 | 
				
			||||||
    |  additive: to parse the text you have to tokenize and tag it. The
 | 
					 | 
				
			||||||
    |  pre-processing was not subtracted from the times — I report the time
 | 
					 | 
				
			||||||
    |  required for the pipeline to complete. I report mean times per document,
 | 
					 | 
				
			||||||
    |  in milliseconds.#[br]#[br]
 | 
					 | 
				
			||||||
    |  #[strong Hardware]: Intel i7-3770 (2012)#[br]
 | 
					 | 
				
			||||||
    |  #[strong Implementation]: #[+src(gh("spacy-benchmarks")) spacy-benchmarks]
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
+table
 | 
					 | 
				
			||||||
    +row.u-text-label.u-text-center
 | 
					 | 
				
			||||||
        th.c-table__head-cell
 | 
					 | 
				
			||||||
        th.c-table__head-cell(colspan="3") Absolute (ms per doc)
 | 
					 | 
				
			||||||
        th.c-table__head-cell(colspan="3") Relative (to spaCy)
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
    +row
 | 
					 | 
				
			||||||
        each column in ["System", "Tokenize", "Tag", "Parse", "Tokenize", "Tag", "Parse"]
 | 
					 | 
				
			||||||
            th.c-table__head-cell.u-text-label=column
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
    +row
 | 
					 | 
				
			||||||
        +cell #[strong spaCy]
 | 
					 | 
				
			||||||
        each data in [ "0.2ms", "1ms", "19ms"]
 | 
					 | 
				
			||||||
            +cell #[strong=data]
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
        each data in [ "1x", "1x", "1x" ]
 | 
					 | 
				
			||||||
            +cell=data
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
    +row
 | 
					 | 
				
			||||||
        each data in [ "CoreNLP", "2ms", "10ms", "49ms", "10x", "10x", "2.6x"]
 | 
					 | 
				
			||||||
            +cell=data
 | 
					 | 
				
			||||||
    +row
 | 
					 | 
				
			||||||
        each data in [ "ZPar", "1ms", "8ms", "850ms", "5x", "8x", "44.7x" ]
 | 
					 | 
				
			||||||
            +cell=data
 | 
					 | 
				
			||||||
    +row
 | 
					 | 
				
			||||||
        each data in [ "NLTK", "4ms", "443ms", "n/a", "20x", "443x", "n/a" ]
 | 
					 | 
				
			||||||
            +cell=data
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
+h(3, "ner") Named entity comparison
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
p
 | 
					 | 
				
			||||||
    |  #[+a("https://aclweb.org/anthology/W/W16/W16-2703.pdf") Jiang et al. (2016)]
 | 
					 | 
				
			||||||
    |  present several detailed comparisons of the named entity recognition
 | 
					 | 
				
			||||||
    |  models provided by spaCy, CoreNLP, NLTK and LingPipe. Here we show their
 | 
					 | 
				
			||||||
    |  evaluation of person, location and organization accuracy on Wikipedia.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
+aside("Methodology")
 | 
					 | 
				
			||||||
    |  Making a meaningful comparison of different named entity recognition
 | 
					 | 
				
			||||||
    |  systems is tricky.  Systems are often trained on different data, which
 | 
					 | 
				
			||||||
    |  usually have slight differences in annotation style. For instance, some
 | 
					 | 
				
			||||||
    |  corpora include titles as part of person names, while others don't.
 | 
					 | 
				
			||||||
    |  These trivial differences in convention can distort comparisons
 | 
					 | 
				
			||||||
    |  significantly. Jiang et al.'s #[em partial overlap] metric goes a long
 | 
					 | 
				
			||||||
    |  way to solving this problem.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
+table([ "System", "Precision", "Recall", "F-measure" ])
 | 
					 | 
				
			||||||
    +row
 | 
					 | 
				
			||||||
        +cell spaCy
 | 
					 | 
				
			||||||
        each data in [ 0.7240, 0.6514, 0.6858 ]
 | 
					 | 
				
			||||||
            +cell=data
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
    +row
 | 
					 | 
				
			||||||
        +cell #[strong CoreNLP]
 | 
					 | 
				
			||||||
        each data in [ 0.7914, 0.7327, 0.7609 ]
 | 
					 | 
				
			||||||
            +cell #[strong=data]
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
    +row
 | 
					 | 
				
			||||||
        +cell NLTK
 | 
					 | 
				
			||||||
        each data in [ 0.5136, 0.6532, 0.5750 ]
 | 
					 | 
				
			||||||
            +cell=data
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
    +row
 | 
					 | 
				
			||||||
        +cell LingPipe
 | 
					 | 
				
			||||||
        each data in [ 0.5412, 0.5357, 0.5384 ]
 | 
					 | 
				
			||||||
            +cell=data
 | 
					 | 
				
			||||||
| 
						 | 
					@ -1,93 +0,0 @@
 | 
				
			||||||
//- 💫 DOCS > API > LANGUAGE MODELS
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
include ../../_includes/_mixins
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
p
 | 
					 | 
				
			||||||
    |  spaCy currently provides models for the following languages and
 | 
					 | 
				
			||||||
    |  capabilities:
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
+aside-code("Download language models", "bash").
 | 
					 | 
				
			||||||
    spacy download en
 | 
					 | 
				
			||||||
    spacy download de
 | 
					 | 
				
			||||||
    spacy download fr
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
+table([ "Language", "Token", "SBD", "Lemma", "POS", "NER", "Dep", "Vector", "Sentiment"])
 | 
					 | 
				
			||||||
    +row
 | 
					 | 
				
			||||||
        +cell English #[code en]
 | 
					 | 
				
			||||||
        each icon in [ "pro", "pro", "pro", "pro", "pro", "pro", "pro", "con" ]
 | 
					 | 
				
			||||||
            +cell.u-text-center #[+procon(icon)]
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
    +row
 | 
					 | 
				
			||||||
        +cell German #[code de]
 | 
					 | 
				
			||||||
        each icon in [ "pro", "pro", "con", "pro", "pro", "pro", "pro", "con" ]
 | 
					 | 
				
			||||||
            +cell.u-text-center #[+procon(icon)]
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
    +row
 | 
					 | 
				
			||||||
        +cell French #[code fr]
 | 
					 | 
				
			||||||
        each icon in [ "pro", "con", "con", "pro", "con", "pro", "pro", "con" ]
 | 
					 | 
				
			||||||
            +cell.u-text-center #[+procon(icon)]
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
    +row
 | 
					 | 
				
			||||||
        +cell Spanish #[code es]
 | 
					 | 
				
			||||||
        each icon in [ "pro", "pro", "con", "pro", "pro", "pro", "pro", "con" ]
 | 
					 | 
				
			||||||
            +cell.u-text-center #[+procon(icon)]
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
p
 | 
					 | 
				
			||||||
    +button("/docs/usage/models", true, "primary") See available models
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
+h(2, "alpha-support") Alpha tokenization support
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
p
 | 
					 | 
				
			||||||
    |  Work has started on the following languages. You can help by
 | 
					 | 
				
			||||||
    |  #[+a("/docs/usage/adding-languages#language-data") improving the existing language data]
 | 
					 | 
				
			||||||
    |  and extending the tokenization patterns.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
+aside("Usage note")
 | 
					 | 
				
			||||||
    |  Note that the alpha languages don't yet come with a language model. In
 | 
					 | 
				
			||||||
    |  order to use them, you have to import them directly:
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
    +code.o-no-block.
 | 
					 | 
				
			||||||
        from spacy.lang.fi import Finnish
 | 
					 | 
				
			||||||
        nlp = Finnish()
 | 
					 | 
				
			||||||
        doc = nlp(u'Ilmatyynyalukseni on täynnä ankeriaita')
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
+infobox("Dependencies")
 | 
					 | 
				
			||||||
    |  Some language tokenizers require external dependencies. To use #[strong Chinese],
 | 
					 | 
				
			||||||
    |  you need to have #[+a("https://github.com/fxsjy/jieba") Jieba] installed.
 | 
					 | 
				
			||||||
    |  The #[strong Japanese] tokenizer requires
 | 
					 | 
				
			||||||
    |  #[+a("https://github.com/mocobeta/janome") Janome].
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
+table([ "Language", "Code", "Source" ])
 | 
					 | 
				
			||||||
    each language, code in { it: "Italian", pt: "Portuguese", nl: "Dutch", sv: "Swedish", fi: "Finnish", nb: "Norwegian Bokmål", da: "Danish", hu: "Hungarian", pl: "Polish", bn: "Bengali", he: "Hebrew", zh: "Chinese", ja: "Japanese" }
 | 
					 | 
				
			||||||
        +row
 | 
					 | 
				
			||||||
            +cell #{language}
 | 
					 | 
				
			||||||
            +cell #[code=code]
 | 
					 | 
				
			||||||
            +cell
 | 
					 | 
				
			||||||
                +src(gh("spaCy", "spacy/lang/" + code)) lang/#{code}
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
+h(2, "multi-language") Multi-language support
 | 
					 | 
				
			||||||
    +tag-new(2)
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
p
 | 
					 | 
				
			||||||
    |  As of v2.0, spaCy supports models trained on more than one language. This
 | 
					 | 
				
			||||||
    |  is especially useful for named entity recognition. The language ID used
 | 
					 | 
				
			||||||
    |  for multi-language or language-neutral models is #[code xx]. The
 | 
					 | 
				
			||||||
    |  language class, a generic subclass containing only the base language data,
 | 
					 | 
				
			||||||
    |  can be found in #[+src(gh("spaCy", "spacy/lang/xx")) lang/xx].
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
p
 | 
					 | 
				
			||||||
    |  To load your model with the neutral, multi-language class, simply set
 | 
					 | 
				
			||||||
    |  #[code "language": "xx"] in your
 | 
					 | 
				
			||||||
    |  #[+a("/docs/usage/saving-loading#models-generating") model package]'s
 | 
					 | 
				
			||||||
    |  meta.json. You can also import the class directly, or call
 | 
					 | 
				
			||||||
    |  #[+api("util#get_lang_class") #[code util.get_lang_class()]] for
 | 
					 | 
				
			||||||
    |  lazy-loading.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
+code("Standard import").
 | 
					 | 
				
			||||||
    from spacy.lang.xx import MultiLanguage
 | 
					 | 
				
			||||||
    nlp = MultiLanguage()
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
+code("With lazy-loading").
 | 
					 | 
				
			||||||
    from spacy.util import get_lang_class
 | 
					 | 
				
			||||||
    nlp = get_lang_class('xx')
 | 
					 | 
				
			||||||
| 
						 | 
					@ -1,93 +0,0 @@
 | 
				
			||||||
//- 💫 DOCS > API > TAGGER
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
include ../../_includes/_mixins
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
p Annotate part-of-speech tags on #[code Doc] objects.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
+under-construction
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
+h(2, "init") Tagger.__init__
 | 
					 | 
				
			||||||
    +tag method
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
p Create a #[code Tagger].
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
+table(["Name", "Type", "Description"])
 | 
					 | 
				
			||||||
    +row
 | 
					 | 
				
			||||||
        +cell #[code vocab]
 | 
					 | 
				
			||||||
        +cell #[code Vocab]
 | 
					 | 
				
			||||||
        +cell The vocabulary. Must be shared with documents to be processed.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
    +row
 | 
					 | 
				
			||||||
        +cell #[code model]
 | 
					 | 
				
			||||||
        +cell #[thinc.linear.AveragedPerceptron]
 | 
					 | 
				
			||||||
        +cell The statistical model.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
    +footrow
 | 
					 | 
				
			||||||
        +cell returns
 | 
					 | 
				
			||||||
        +cell #[code Tagger]
 | 
					 | 
				
			||||||
        +cell The newly constructed object.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
+h(2, "call") Tagger.__call__
 | 
					 | 
				
			||||||
    +tag method
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
p Apply the tagger, setting the POS tags onto the #[code Doc] object.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
+table(["Name", "Type", "Description"])
 | 
					 | 
				
			||||||
    +row
 | 
					 | 
				
			||||||
        +cell #[code doc]
 | 
					 | 
				
			||||||
        +cell #[code Doc]
 | 
					 | 
				
			||||||
        +cell The tokens to be tagged.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
    +footrow
 | 
					 | 
				
			||||||
        +cell returns
 | 
					 | 
				
			||||||
        +cell #[code None]
 | 
					 | 
				
			||||||
        +cell -
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
+h(2, "pipe") Tagger.pipe
 | 
					 | 
				
			||||||
    +tag method
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
p Tag a stream of documents.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
+table(["Name", "Type", "Description"])
 | 
					 | 
				
			||||||
    +row
 | 
					 | 
				
			||||||
        +cell #[code stream]
 | 
					 | 
				
			||||||
        +cell -
 | 
					 | 
				
			||||||
        +cell The sequence of documents to tag.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
    +row
 | 
					 | 
				
			||||||
        +cell #[code batch_size]
 | 
					 | 
				
			||||||
        +cell int
 | 
					 | 
				
			||||||
        +cell The number of documents to accumulate into a working set.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
    +row
 | 
					 | 
				
			||||||
        +cell #[code n_threads]
 | 
					 | 
				
			||||||
        +cell int
 | 
					 | 
				
			||||||
        +cell
 | 
					 | 
				
			||||||
            |  The number of threads with which to work on the buffer in
 | 
					 | 
				
			||||||
            |  parallel.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
    +footrow
 | 
					 | 
				
			||||||
        +cell yields
 | 
					 | 
				
			||||||
        +cell #[code Doc]
 | 
					 | 
				
			||||||
        +cell Documents, in order.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
+h(2, "update") Tagger.update
 | 
					 | 
				
			||||||
    +tag method
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
p Update the statistical model, with tags supplied for the given document.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
+table(["Name", "Type", "Description"])
 | 
					 | 
				
			||||||
    +row
 | 
					 | 
				
			||||||
        +cell #[code doc]
 | 
					 | 
				
			||||||
        +cell #[code Doc]
 | 
					 | 
				
			||||||
        +cell The example document for the update.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
    +row
 | 
					 | 
				
			||||||
        +cell #[code gold]
 | 
					 | 
				
			||||||
        +cell #[code GoldParse]
 | 
					 | 
				
			||||||
        +cell Manager for the gold-standard tags.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
    +footrow
 | 
					 | 
				
			||||||
        +cell returns
 | 
					 | 
				
			||||||
        +cell int
 | 
					 | 
				
			||||||
        +cell Number of tags predicted correctly.
 | 
					 | 
				
			||||||
| 
						 | 
					@ -1,7 +0,0 @@
 | 
				
			||||||
//- 💫 DOCS > API > TENSORIZER
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
include ../../_includes/_mixins
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
p Add a tensor with position-sensitive meaning representations to a #[code Doc].
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
+under-construction
 | 
					 | 
				
			||||||
| 
						 | 
					@ -1,21 +0,0 @@
 | 
				
			||||||
//- 💫 DOCS > API > TEXTCATEGORIZER
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
include ../../_includes/_mixins
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
p
 | 
					 | 
				
			||||||
    |  Add text categorization models to spaCy pipelines. The model supports
 | 
					 | 
				
			||||||
    |  classification with multiple, non-mutually exclusive labels.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
p
 | 
					 | 
				
			||||||
    |  You can change the model architecture rather easily, but by default, the
 | 
					 | 
				
			||||||
    |  #[code TextCategorizer] class uses a convolutional neural network to
 | 
					 | 
				
			||||||
    |  assign position-sensitive vectors to each word in the document. This step
 | 
					 | 
				
			||||||
    |  is similar to the #[+api("tensorizer") #[code Tensorizer]] component, but the
 | 
					 | 
				
			||||||
    |  #[code TextCategorizer] uses its own CNN model, to avoid sharing weights
 | 
					 | 
				
			||||||
    |  with the other pipeline components. The document tensor is then
 | 
					 | 
				
			||||||
    |  summarized by concatenating max and mean pooling, and a multilayer
 | 
					 | 
				
			||||||
    |  perceptron is used to predict an output vector of length #[code nr_class],
 | 
					 | 
				
			||||||
    |  before a logistic activation is applied elementwise. The value of each
 | 
					 | 
				
			||||||
    |  output neuron is the probability that some class is present.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
+under-construction
 | 
					 | 
				
			||||||
| 
						 | 
					@ -1,7 +0,0 @@
 | 
				
			||||||
//- 💫 DOCS > API > VECTORS
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
include ../../_includes/_mixins
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
p A container class for vector data keyed by string.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
+under-construction
 | 
					 | 
				
			||||||
							
								
								
									
										72
									
								
								website/usage/_models/_languages.jade
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										72
									
								
								website/usage/_models/_languages.jade
									
									
									
									
									
										Normal file
									
								
							| 
						 | 
					@ -0,0 +1,72 @@
 | 
				
			||||||
 | 
					//- 💫 DOCS > USAGE > MODELS > LANGUAGE SUPPORT
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					p spaCy currently provides models for the following languages:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+table(["Language", "Code", "Language data", "Models"])
 | 
				
			||||||
 | 
					    for models, code in MODELS
 | 
				
			||||||
 | 
					        - var count = Object.keys(models).length
 | 
				
			||||||
 | 
					        +row
 | 
				
			||||||
 | 
					            +cell=LANGUAGES[code]
 | 
				
			||||||
 | 
					            +cell #[code=code]
 | 
				
			||||||
 | 
					            +cell
 | 
				
			||||||
 | 
					                +src(gh("spaCy", "spacy/lang/" + code)) #[code lang/#{code}]
 | 
				
			||||||
 | 
					            +cell
 | 
				
			||||||
 | 
					                +a("/models/" + code) #{count} #{(count == 1) ? "model" : "models"}
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+h(3, "alpha-support") Alpha tokenization support
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					p
 | 
				
			||||||
 | 
					    |  Work has started on the following languages. You can help by
 | 
				
			||||||
 | 
					    |  #[+a("/usage/adding-languages#language-data") improving the existing language data]
 | 
				
			||||||
 | 
					    |  and extending the tokenization patterns.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+aside("Usage note")
 | 
				
			||||||
 | 
					    |  Note that the alpha languages don't yet come with a language model. In
 | 
				
			||||||
 | 
					    |  order to use them, you have to import them directly, or use
 | 
				
			||||||
 | 
					    |  #[+api("spacy#blank") #[code spacy.blank]]:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    +code.o-no-block.
 | 
				
			||||||
 | 
					        from spacy.lang.fi import Finnish
 | 
				
			||||||
 | 
					        nlp = Finnish()  # use directly
 | 
				
			||||||
 | 
					        nlp = spacy.blank('fi')  # blank instance
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+table(["Language", "Code", "Language data"])
 | 
				
			||||||
 | 
					    for lang, code in LANGUAGES
 | 
				
			||||||
 | 
					        if !Object.keys(MODELS).includes(code)
 | 
				
			||||||
 | 
					            +row
 | 
				
			||||||
 | 
					                +cell #{LANGUAGES[code]}
 | 
				
			||||||
 | 
					                +cell #[code=code]
 | 
				
			||||||
 | 
					                +cell
 | 
				
			||||||
 | 
					                    +src(gh("spaCy", "spacy/lang/" + code)) #[code lang/#{code}]
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+infobox("Dependencies")
 | 
				
			||||||
 | 
					    |  Some language tokenizers require external dependencies. To use #[strong Chinese],
 | 
				
			||||||
 | 
					    |  you need to have #[+a("https://github.com/fxsjy/jieba") Jieba] installed.
 | 
				
			||||||
 | 
					    |  The #[strong Japanese] tokenizer requires
 | 
				
			||||||
 | 
					    |  #[+a("https://github.com/mocobeta/janome") Janome].
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+h(3, "multi-language") Multi-language support
 | 
				
			||||||
 | 
					    +tag-new(2)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					p
 | 
				
			||||||
 | 
					    |  As of v2.0, spaCy supports models trained on more than one language. This
 | 
				
			||||||
 | 
					    |  is especially useful for named entity recognition. The language ID used
 | 
				
			||||||
 | 
					    |  for multi-language or language-neutral models is #[code xx]. The
 | 
				
			||||||
 | 
					    |  language class, a generic subclass containing only the base language data,
 | 
				
			||||||
 | 
					    |  can be found in #[+src(gh("spaCy", "spacy/lang/xx")) #[code lang/xx]].
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					p
 | 
				
			||||||
 | 
					    |  To load your model with the neutral, multi-language class, simply set
 | 
				
			||||||
 | 
					    |  #[code "language": "xx"] in your
 | 
				
			||||||
 | 
					    |  #[+a("/usage/training#models-generating") model package]'s
 | 
				
			||||||
 | 
					    |  meta.json. You can also import the class directly, or call
 | 
				
			||||||
 | 
					    |  #[+api("util#get_lang_class") #[code util.get_lang_class()]] for
 | 
				
			||||||
 | 
					    |  lazy-loading.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+code("Standard import").
 | 
				
			||||||
 | 
					    from spacy.lang.xx import MultiLanguage
 | 
				
			||||||
 | 
					    nlp = MultiLanguage()
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+code("With lazy-loading").
 | 
				
			||||||
 | 
					    from spacy.util import get_lang_class
 | 
				
			||||||
 | 
					    nlp = get_lang_class('xx')
 | 
				
			||||||
		Loading…
	
		Reference in New Issue
	
	Block a user