* Add edit tree lemmatizer Co-authored-by: Daniël de Kok <me@danieldk.eu> * Hide edit tree lemmatizer labels * Use relative imports * Switch to single quotes in error message * Type annotation fixes Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Reformat edit_tree_lemmatizer with black * EditTreeLemmatizer.predict: take Iterable Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Validate edit trees during deserialization This change also changes the serialized representation. Rather than mirroring the deep C structure, we use a simple flat union of the match and substitution node types. * Move edit_trees to _edit_tree_internals * Fix invalid edit tree format error message * edit_tree_lemmatizer: remove outdated TODO comment * Rename factory name to trainable_lemmatizer * Ignore type instead of casting truths to List[Union[Ints1d, Floats2d, List[int], List[str]]] for thinc v8.0.14 * Switch to Tagger.v2 * Add documentation for EditTreeLemmatizer * docs: Fix 3.2 -> 3.3 somewhere * trainable_lemmatizer documentation fixes * docs: EditTreeLemmatizer is in edit_tree_lemmatizer.py Co-authored-by: Daniël de Kok <me@danieldk.eu> Co-authored-by: Daniël de Kok <me@github.danieldk.eu> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
		
			
				
	
	
	
		
			9.4 KiB
		
	
	
	
	
	
	
	
			
		
		
	
	The central data structures in spaCy are the Language class,
the Vocab and the Doc object. The Language class
is used to process a text and turn it into a Doc object. It's typically stored
as a variable called nlp. The Doc object owns the sequence of tokens and
all their annotations. By centralizing strings, word vectors and lexical
attributes in the Vocab, we avoid storing multiple copies of this data. This
saves memory, and ensures there's a single source of truth.
Text annotations are also designed to allow a single source of truth: the Doc
object owns the data, and Span and Token are
views that point into it. The Doc object is constructed by the
Tokenizer, and then modified in place by the components
of the pipeline. The Language object coordinates these components. It takes
raw text and sends it through the pipeline, returning an annotated document.
It also orchestrates training and serialization.
Container objects
| Name | Description | 
|---|---|
| Doc | A container for accessing linguistic annotations. | 
| DocBin | A collection of Docobjects for efficient binary serialization. Also used for training data. | 
| Example | A collection of training annotations, containing two Docobjects: the reference data and the predictions. | 
| Language | Processing class that turns text into Docobjects. Different languages implement their own subclasses of it. The variable is typically callednlp. | 
| Lexeme | An entry in the vocabulary. It's a word type with no context, as opposed to a word token. It therefore has no part-of-speech tag, dependency parse etc. | 
| Span | A slice from a Docobject. | 
| SpanGroup | A named collection of spans belonging to a Doc. | 
| Token | An individual token — i.e. a word, punctuation symbol, whitespace, etc. | 
Processing pipeline
The processing pipeline consists of one or more pipeline components that are
called on the Doc in order. The tokenizer runs before the components. Pipeline
components can be added using Language.add_pipe.
They can contain a statistical model and trained weights, or only make
rule-based modifications to the Doc. spaCy provides a range of built-in
components for different language processing tasks and also allows adding
custom components.
| Name | Description | 
|---|---|
| AttributeRuler | Set token attributes using matcher rules. | 
| DependencyParser | Predict syntactic dependencies. | 
| EditTreeLemmatizer | Predict base forms of words. | 
| EntityLinker | Disambiguate named entities to nodes in a knowledge base. | 
| EntityRecognizer | Predict named entities, e.g. persons or products. | 
| EntityRuler | Add entity spans to the Docusing token-based rules or exact phrase matches. | 
| Lemmatizer | Determine the base forms of words using rules and lookups. | 
| Morphologizer | Predict morphological features and coarse-grained part-of-speech tags. | 
| SentenceRecognizer | Predict sentence boundaries. | 
| Sentencizer | Implement rule-based sentence boundary detection that doesn't require the dependency parse. | 
| Tagger | Predict part-of-speech tags. | 
| TextCategorizer | Predict categories or labels over the whole document. | 
| Tok2Vec | Apply a "token-to-vector" model and set its outputs. | 
| Tokenizer | Segment raw text and create Docobjects from the words. | 
| TrainablePipe | Class that all trainable pipeline components inherit from. | 
| Transformer | Use a transformer model and set its outputs. | 
| Other functions | Automatically apply something to the Doc, e.g. to merge spans of tokens. | 
Matchers
Matchers help you find and extract information from Doc objects
based on match patterns describing the sequences you're looking for. A matcher
operates on a Doc and gives you access to the matched tokens in context.
| Name | Description | 
|---|---|
| DependencyMatcher | Match sequences of tokens based on dependency trees using Semgrex operators. | 
| Matcher | Match sequences of tokens, based on pattern rules, similar to regular expressions. | 
| PhraseMatcher | Match sequences of tokens based on phrases. | 
Other classes
| Name | Description | 
|---|---|
| Corpus | Class for managing annotated corpora for training and evaluation data. | 
| KnowledgeBase | Storage for entities and aliases of a knowledge base for entity linking. | 
| Lookups | Container for convenient access to large lookup tables and dictionaries. | 
| MorphAnalysis | A morphological analysis. | 
| Morphology | Store morphological analyses and map them to and from hash values. | 
| Scorer | Compute evaluation scores. | 
| StringStore | Map strings to and from hash values. | 
| Vectors | Container class for vector data keyed by string. | 
| Vocab | The shared vocabulary that stores strings and gives you access to Lexemeobjects. |